Home » How can swapoff be that slow?

How can swapoff be that slow?


First, let’s look at what you can expect from your hard drive. Your hard drive can do 200 MB/s sequentially. When you factor seek times in, it can be much slower. To pick an arbitrary example, take a look at the specs for one of Seagate’s modern 3TB disks, the ST3000DM001:

  • Max sustained data rate: 210 MB/s

  • Seek average read: <8.5 ms

  • Bytes per sector: 4,096

If you never need to seek, and if your swap is near the edge of the disk, you can expect to see up to the max rate = 210 MB/s

But if your swap data is entirely fragmented, in the worst case scenario, you’d need to seek around for every sector you read. That means that you only get to read 4 KB every 8.5 ms, or 4 KB / 0.0085 = 470 KB/s

So right off the bat, it’s not inconceivable that you are in fact running up against hard drive speeds.

That said, it does seem silly that swapoff would run so slowly and have to read pages out of order, especially if they were written quickly (which implies in-order). But that may just be how the kernel works. Ubuntu bug report #486666 discusses the same problem:

The swap is being removed at speed of 0.5 MB/s, while the
hard drive speed is 60 MB/s;
No other programs are using harddrive a lot, system is not under
high load etc.

Ubuntu 9.10 on quad core.

Swap partition is encrypted.
Top (atop) shows near 100% hard drive usage
  DSK | sdc | busy 88% | read 56 | write 0 | avio 9 ms |
but the device transfer is low (kdesysguard)
  0.4 MiB/s on /dev/sdc reads, and 0 on writes

One of the replies was:

It takes a long time to sort out because it has to rearrange and flush the
memory, as well as go through multiple decrypt cycles, etc. This is quite

The bug report was closed unresolved.

Mel Gorman’s book “Understanding the Linux Virtual Memory Manager” is a bit out of date, but agrees that this is a slow operation:

The function responsible for deactivating an area is, predictably
enough, called sys_swapoff(). This function is mainly concerned with
updating the swap_info_struct. The major task of paging in each
paged-out page is the responsibility of try_to_unuse() which is
extremely expensive.

There’s a bit more discussion from 2007 on the linux-kernel mailing list with the subject “speeding up swapoff” — although the speeds they’re discussing there are a bit higher than what you are seeing.

It’s an interesting question that probably gets generally ignored, since swapoff is rarely used. I think that if you really wanted to track it down, the first step would be trying to watch your disk usage patterns more carefully (maybe with atop, iostat, or even more powerful tools like perf or systemtap). Things to look for might be excessive seeking, small I/O operations, constant rewriting and movement of data, etc.

I’ve been experiencing the same problem with my laptop which has a SSD so seeks times shouldn’t be a problem.

I found an alternative explanation. Here is an excerpt

The way it works now, swapoff looks at each swapped out memory page in
the swap partition, and tries to find all the programs that use it. If
it can’t find them right away, it will look at the page tables of
every program that’s running to find them. In the worst case, it will
check all the page tables for every swapped out page in the partition.
That’s right–the same page tables get checked over and over again.

So it is a kernel problem rather than anything else.

Yup, the swapoff mechanism is horrendously inefficient. The workaround is easy: iterate over processes, instead iterating over the swapped pages. Use this python script (I am not affiliated):

git clone https://github.com/wiedemannc/deswappify-auto
cd ./deswappify-auto
sudo python3 deswappify_auto.py -d -v info

Note that the daemon mode of operation is only for desktops/laptops that are often hibernated. I wouldn’t run it as a daemon on a server system – just run it for a while, wait until it reports it took care of some processes then stop it and try:

swapoff /dev/x

Since most of the pages are now present both in swap and in memory, the swapoff has very little to do and should be now blazingly fast (I saw hundreds of MB/s).

History section ahead

The aforementioned python script is based on the rest of this answer, which in turn was my improvement of this older answer authored by jlong. As the script is much much safer I recommend to only try the rest of my answer as the last line of defense:

perl -we 'for(`ps -e -o pid,args`) { if(m/^ *(d+) *(.{0,40})/) { $pid=$1; $desc=$2; if(open F, "/proc/$pid/smaps") { while(<F>) { if(m/^([0-9a-f]+)-([0-9a-f]+) /si){ $start_adr=$1; $end_adr=$2; }  elsif(m/^Swap:s*(dd+) *kB/s){ print "SSIZE=$1_kBt gdb --batch --pid $pid -ex "dump memory /dev/null 0x$start_adr 0x$end_adr"t2>&1 >/dev/null |grep -v debugt### $desc n" }}}}}' | sort -Vr | head

This runs maybe 2 seconds and won’t actually do anything, just list the top 10 memory segments (actually it prints more one-liners; yes I do love one-liners; just examine the commands, accept the risk, copy and paste into your shell; these will actually read from swap).

...Paste the generated one-liners...
swapoff /your/swap    # much faster now

The main one-liner is safe (for me), except it reads a lot of /proc.

The sub-commands prepared for your manual examination are not safe. Each command will hang one process for the duration of reading a memory segment from swap. So it’s unsafe with processes that don’t tolerate any pauses. The transfer speeds I saw were on the order of 1 gigabyte per minute. (The aforementioned python script removed that deficiency).

Another danger is putting too much memory pressure on the system, so check with the usual free -m

What does it do?

for(`ps -e -o pid,args`) {

  if(m/^ *(d+) *(.{0,40})/) { 

    if(open F, "/proc/$pid/smaps") { 

      while(<F>) { 

        if(m/^([0-9a-f]+)-([0-9a-f]+) /si){ 
        } elsif( m/^Swap:s*(dd+) *kB/s ){
          print "SSIZE=$1_kBt gdb --batch --pid $pid -ex "dump memory /dev/null 0x$start_adr 0x$end_adr"t2>&1 >/dev/null |grep -v debugt### $desc n" 

The output of this perl script is a series of gdb commands dump memory (range) which recall swapped pages to memory.

The output starts with the size, so it’s easy enough to pass it trough | sort -Vr | head to get top 10 largest segments by size (SSIZE). The -V stands for version-number-suitable sorting, but it works for my purpose. I couldn’t figure how to make numerical sort work.

Related Solutions

Joining bash arguments into single string with spaces

[*] I believe that this does what you want. It will put all the arguments in one string, separated by spaces, with single quotes around all: str="'$*'" $* produces all the scripts arguments separated by the first character of $IFS which, by default, is a space....

AddTransient, AddScoped and AddSingleton Services Differences

TL;DR Transient objects are always different; a new instance is provided to every controller and every service. Scoped objects are the same within a request, but different across different requests. Singleton objects are the same for every object and every...

How to download package not install it with apt-get command?

Use --download-only: sudo apt-get install --download-only pppoe This will download pppoe and any dependencies you need, and place them in /var/cache/apt/archives. That way a subsequent apt-get install pppoe will be able to complete without any extra downloads....

What defines the maximum size for a command single argument?

Answers Definitely not a bug. The parameter which defines the maximum size for one argument is MAX_ARG_STRLEN. There is no documentation for this parameter other than the comments in binfmts.h: /* * These are the maximum length and maximum number of strings...

Bulk rename, change prefix

I'd say the simplest it to just use the rename command which is common on many Linux distributions. There are two common versions of this command so check its man page to find which one you have: ## rename from Perl (common in Debian systems -- Ubuntu, Mint,...

Output from ls has newlines but displays on a single line. Why?

When you pipe the output, ls acts differently. This fact is hidden away in the info documentation: If standard output is a terminal, the output is in columns (sorted vertically) and control characters are output as question marks; otherwise, the output is...

mv: Move file only if destination does not exist

mv -vn file1 file2. This command will do what you want. You can skip -v if you want. -v makes it verbose - mv will tell you that it moved file if it moves it(useful, since there is possibility that file will not be moved) -n moves only if file2 does not exist....

Is it possible to store and query JSON in SQLite?

SQLite 3.9 introduced a new extension (JSON1) that allows you to easily work with JSON data . Also, it introduced support for indexes on expressions, which (in my understanding) should allow you to define indexes on your JSON data as well. PostgreSQL has some...

Combining tail && journalctl

You could use: journalctl -u service-name -f -f, --follow Show only the most recent journal entries, and continuously print new entries as they are appended to the journal. Here I've added "service-name" to distinguish this answer from others; you substitute...

how can shellshock be exploited over SSH?

One example where this can be exploited is on servers with an authorized_keys forced command. When adding an entry to ~/.ssh/authorized_keys, you can prefix the line with command="foo" to force foo to be run any time that ssh public key is used. With this...

Why doesn’t the tilde (~) expand inside double quotes?

The reason, because inside double quotes, tilde ~ has no special meaning, it's treated as literal. POSIX defines Double-Quotes as: Enclosing characters in double-quotes ( "" ) shall preserve the literal value of all characters within the double-quotes, with the...

What is GNU Info for?

GNU Info was designed to offer documentation that was comprehensive, hyperlinked, and possible to output to multiple formats. Man pages were available, and they were great at providing printed output. However, they were designed such that each man page had a...

Set systemd service to execute after fstab mount

a CIFS network location is mounted via /etc/fstab to /mnt/ on boot-up. No, it is not. Get this right, and the rest falls into place naturally. The mount is handled by a (generated) systemd mount unit that will be named something like mnt-wibble.mount. You can...

Merge two video clips into one, placing them next to each other

To be honest, using the accepted answer resulted in a lot of dropped frames for me. However, using the hstack filter_complex produced perfectly fluid output: ffmpeg -i left.mp4 -i right.mp4 -filter_complex hstack output.mp4 ffmpeg -i input1.mp4 -i input2.mp4...

How portable are /dev/stdin, /dev/stdout and /dev/stderr?

It's been available on Linux back into its prehistory. It is not POSIX, although many actual shells (including AT&T ksh and bash) will simulate it if it's not present in the OS; note that this simulation only works at the shell level (i.e. redirection or...

How can I increase the number of inodes in an ext4 filesystem?

It seems that you have a lot more files than normal expectation. I don't know whether there is a solution to change the inode table size dynamically. I'm afraid that you need to back-up your data, and create new filesystem, and restore your data. To create new...

Why doesn’t cp have a progress bar like wget?

The tradition in unix tools is to display messages only if something goes wrong. I think this is both for design and practical reasons. The design is intended to make it obvious when something goes wrong: you get an error message, and it's not drowned in...