The State of Linux IO Scheduling For the Desktop? 472
pinkeen writes "I've used Linux as my work & play OS for 5+ years. The one thing that constantly drives me mad is its IO scheduling. When I'm copying a large amount of data in the background, everything else slows down to a crawl while the CPU utilization stays at 1-2%. The process which does the actual copying is highly prioritized in terms of I/O. This is completely unacceptable for a desktop OS. I've heard about the efforts of Con Kolivas and his Brainfuck Scheduler, but it's unsupported now and probably incompatible with latest kernels. Is there any way to fix this? How do you deal with this? I have a feeling that if this issue was to be fixed, the whole desktop would become way more snappier, even if you're not doing any heavy IO in the background."
Update: 10/23 22:06 GMT by T : As reader ehntoo points out in the discussion below, contrary to the submitter's impression, "Con Kolivas is still actively working on BFS, it's not unsupported. He's even got a patch for 2.6.36, which was only released on the 20th. He's also got a patchset out that I use on all my desktops which includes a bunch of tweaks for desktop use." Thanks to ehntoo, and hat tip to Bill Huey.
It sucks I agree (Score:4, Interesting)
This issue got so bad for me I switched to FreeBSD.
Re:what about servers? (Score:2, Interesting)
On IO intensive server: this is also a real issue. 20-30% of processors and cores stuck with a 99% iowait for hours, while the rest tries to cope. Total CPU load does not go above 20%. No solution yet after months of study and experimenting. Linux is indeed really bad at IO scheduling in general, it seems.
Notw think of that situation and a heavy database system. A no-no solution.
Is it really only a matter of scheduling? (Score:5, Interesting)
Re:Perhaps you should.. (Score:4, Interesting)
Theres a bug in chrome that causes it to usually be unable to paste into slashdot's comment box once you've placed an < character in the box. (Slashdot, specfically. It does fine on all sorts of other sites with even fancier ajaxy textareas like the stackoverflow sites)
Re:what about servers? (Score:3, Interesting)
How does this happen? Every year it seems I read about how this problem has been fixed in the latest kernel, and then it's like those fixes mysterious vanish?
Re:what about servers? (Score:2, Interesting)
This problem is highly visible in VMs. When you have one VM doing write-heavy disk IO, the other VMs suffer.
I don't think it's a Linux problem as much as a general problem of the compromises that must be made by any scheduling algorithm.
What about you Linux mainframe guys? You have unbeatable IO subsystems. Do you see the same problems?
Switch to Deadline (Score:1, Interesting)
I ran into the same problems and ended up switching to the "deadline" scheduler. Haven't had a single problem since. I changed it via the "elevator=deadline" on the kernel boot prompt, but you can change it on the fly for individual devices. See Configuring and Optimizing Your I/O Scheduler [devshed.com] to see how.
OS/2 (Score:2, Interesting)
I remember using OS/2 (IBM's desktop OS) and i was always amazed that you could format a floppy and do other tasks like nothing else was going on. I never did understand why that never seemed to make it into the mainstream.
Wrong Question (Score:3, Interesting)
This is not a case of Linux IO schedulers being unsuitable for the desktop, but more a case of desktop applications being written in a horrendous way in terms of data access. The general pattern being to open up a file object, load in a few hundred kilobytes, processing this then asking the operating system for more. This is a small inefficiency when the resource is doing nothing, but if the disk is actually busy, then it will probably be doing something else by the time you ask for it to read a little bit more. Not to mention the habit of reading through a few hundred resource files one at a time in seemingly random order, and blocking every time it reads, because the application programmer is too lazy to think about what resources the app is using.
Linux has such a nice implementation of mmap, which works by letting Linux actually know ahead of time what files you are interested in and managing them itself, without the application programmer worrying his pretty little head over it. Other options are running multiple non-blocking reads at the same time and loading the right amount of data and the right files to begin with.
The best thing about a simple CSCAN algorithm is that it gives applications what they asked for and if the application doesn't know what it wants, well, that's hardly a system issue.
Re:It sucks I agree (Score:5, Interesting)
This is the number one problem with all Linux installations I have ever used. The problem is most noticeable in Ubuntu where, any time one of the frequent update/tracker programs runs, the entire system will become all but unusable for several minutes.
I don't know if it's all that related, but swap slowdown is an appalling issue as well. If a single program spikes in RAM usage, I often have to reboot the whole system as it hangs indefinitely. As I work with Octave a lot, often a script will gobble up a few hundred megs of memory and push the system into swap. Once that happens, it's often too late to do anything about it as programs simply will not respond.
Re:It sucks I agree (Score:4, Interesting)
That's exactly why I stopped using swap a couple of years ago. On my main machine I have 3 GB and I feel that if I reach the limit on that, then whatever program is running is probably a lost cause anyway. The next malloc/new causes the program to crash, saving the system.
Re:what about servers? (Score:3, Interesting)
It's been a big issue for me. Go to a directory with a couple of large files (say a dvd rip) and do a "cat * > newfile". Watch your system come to a crawl.
Re:easy solution: (Score:1, Interesting)
This is actually one of the very reasons (the other being multithreaded performance) why many of us use Windows Server 2003/2008 sometimes in preference to Linux.
Re:BFS Isn't Unsupported (Score:2, Interesting)
I had been wondering about this myself, for some reason I was under the impression that the BFS was no longer being maintained.
It turns out there is an up-to-date package for Ubuntu (I'm running 10.10) as well: http://launchpad.net/~chogydan/+archive/ppa [launchpad.net]
I thought I'd try it out as the installation was much more straightforward than I'd expected.
'uname -r' now reveals "2.6.35-22ck-generic" and, while this is just my subjective assessment, a few of the quirks I had noticed before on my own system where things would get sluggish when switching between apps / opening closing apps while running things that read/write to the disk, seem to have been ironed out.
I would love to test this in a more empirical manner, as I can now boot into either kernel to do comparisons, but I don't know of any software that would allow me to benchmark performance in a way that is sensitive to the optimizations the BFS allegedly implements.
Re:what about servers? (Score:4, Interesting)
Re:Is it really only a matter of scheduling? (Score:3, Interesting)
Wow, that's a new one ?
Perform a non-necessary fget on a file already known to be zero bytes, just so we can get a result "this fget failed because the file is zero bytes".
while (!eof()) {
readsomething();
}
is something I learnt perhaps 20 years ago, and it's never failed me yet. Why must people always try reinventing the wheel, just to end up with an octagon ?
It isn't only IO scheduling (Score:4, Interesting)
I've encountered situations where I'm trying to do something online and a task starts up due to a cron job that builds some kind of index. The index building should be in the background but somehow takes priority over what I'm doing on the desktop. Those kinds of cron jobs should be default scheduled in the background, not take priority over what is happening on the desktop.
Re:easy solution: (Score:4, Interesting)
That's great that you post your experiences with server scheduling in a topic about desktop scheduling. It's so relevant. No wait, it's not.
The boundary between the desktop space and the server space is rather fluid, and many of the problems visible on servers are also visible on desktops - and vice versa.
For example 'copying a large amount of data' on a server is similar to 'copying a big ISO on the desktop'. If the kernel sucks doing one then it will likely suck when doing the other as well.
So both cases should be handled by the kernel in an excellent fashion - with an optimization/tuning focus on desktop workloads, because they are almost always the more diverse ones, and hence are generally the technically more challenging cases as well.
Thanks,
Ingo
Re:It sucks I agree (Score:5, Interesting)
MythTV added a feature a while back to work around this issue. IIRC, they now keep a handle open to video files while they delete them. This causes the kernel to not actually do the delete, then over a span of about 10 minutes MythTV repeatedly shaves chunks off the end using truncate() until the file reaches 0 bytes.
Prior to this, the system could get really bogged down right after deleting shows. I was careful not to delete too many shows at once; I had actually seen the back end lock up after telling it to delete a bunch of shows.
Re:It sucks I agree (Score:2, Interesting)
Re:Is it really only a matter of scheduling? (Score:3, Interesting)
Yeah, that's certainly a possibility.
This is also the goal of most heuristics in the kernel: to figure out a hidden piece of information that the application (and user) has not passed to the kernel explicitly.
The problem comes when the kernel gets it wrong - the kernel and applications can easily get into a feedback loop / arms race of who knows how to trick the other one into doing what the app writer (or kernel writer) thinks is best. In such cases we get the worst of both worlds: we get the bad case and we get the cost of heuristics.
(Heuristic and predictive systems also tend to be complex and hard to analyze: you can rarely reproduce bugs without having the exact same filesystem layout and usage pattern as the user experienced, etc.)
What we found is that in terms of default behavior it's a bit better to keep things simple and predictable/deterministic and then give apps the way to inject extra information into the kernel. We have the fadvise/madvise calls which can be used with the POSIX_FADV_DONTNEED flag to drop cached content from the page cache.
Heuristics and predictive techniques are done when we can be reasonably sure that we get the decisions right: for example there's a piece of fairly advanced code in the Linux page cache trying to figure out whether to pre-fetch data or not.
The large file copy interactivity problems some have mentioned here were most likely real kernel bugs (in the filesystem, IO scheduling and VM subsystems) and were hopefully fixed in the v2.6.33 - v2.6.36 timeframe.
If you can still reproduce any such problems then please report them to linux-kernel@vger.kernel.org so we can fix it ASAP.
In any case, we could all be wrong about it, so if you have a good implementation of more aggressive predictive algorithms i'm sure a lot of people would try them out - me included. We kernel developers want a better desktop just as much as you want it.
Re:It sucks I agree (Score:5, Interesting)
There's also the VM fix from Wu Fengguang [lkml.org], included in v2.6.36, which addresses similar "slowdown while copying large amounts of data" bugs.
There were about a dozen kernel bugs causing similar symptoms, which we fixed over the course of several kernel releases. They were almost evenly spread out between filesystem code, the VM and the IO scheduler. And yes, i agree that it took too long to acknowledge and address them - these problems have been going on for several years. It's a serious kernel development process failure.
If anyone here still experiences bad desktop stalls while handling big files with v2.6.36 too then we'd appreciate a quick bug report sent to linux-kernel@vger.kernel.org.
Thanks,
Ingo
ionice -c 3 can improve performance (Score:3, Interesting)
I often note that multiple simultaneous low-priority file copies implemented as:
run faster than multiple simultaneous high-priority copies implemented as:
If the copies are run one at a time, the higher priority rsync runs faster. For multiple copies, often the lower priority rsyncs run faster. Also, desktop usability is much improved with the lower priority rsyncs.
I suspect a priority inversion occurs inside the file systems write back cache. At regular priority levels, data is not written back to disk in a timely manner. The ionice -c 3 gives the disk caches a higher priority than the rsync I/O commands, preventing the I/O commands from filling the cache and creating a priority inversion.
The Gnome GUI in Ubuntu is particularly vulnerable to this priority inversion, as by default it does multiple copies simultaneously inside a separate window. Ubuntu usually performs better than Windows however. Between the A-V software in Windows, and the tendency to swap applications out of memory to maximize disk cache, Windows usually performs the same copy operations more slowly than Ubuntu and with less system responsiveness.