Is ext4 Stable For Production Systems? 289
dr_dracula writes "Earlier this year, the ext4 filesystem was accepted into the Linux kernel. Shortly thereafter, it was discovered that some applications, such as KDE, were at risk of losing files when used on top of ext4. This was diagnosed as a rift between the design of the ext4 filesystem and the design of applications running on top of ext4. The crux of the problem was that applications were relying on ext3-specific behavior for flushing data to disk, which ext4 was not following. Recent kernel releases include patches to address these issues. My questions to the early adopters of ext4 are about whether the patches have performed as expected. What is your overall feeling about ext4? Do you think is solid enough for most users to trust it with their data? Did you find any significant performance improvements compared to ext3? Is there any incentive to move to ext4, other than sheer curiosity?"
Risk Vs Benefits Analysis (Score:5, Insightful)
Is ext4 Stable For Production Systems?
Probably.
Is there any incentive to move to ext4, other than sheer curiosity?
Ok so I'm gussing production = income = your ass? Let me turn your question back to you by asking, "What is driving this need to move to ext4?" Because so far, all you've told me is that you are considering risking your ass for sheer curiosity.
... no, we could have a customer on the phone saying, "You mean to tell me that the modifications being made to my site for the past 24 hours are gone?!"
I may be grossly misinformed but that is how the question sounds to me. And by "your ass" I don't mean oh-no-we-had-a-service-outage-for-five-minutes
If it ain't broke, don't fix it!
I don't know about you but I'm too busy dealing with shit like this [youtube.com] than to ponder new potential problems I can put into play.
Look through this page [wikipedia.org] for a rough comparison of ext4 with other file systems. There's a better list of features for ext4 here [wikipedia.org] that will tell you why you might need to switch to it. It is backward compatible with ext3 and ext2 so moving to it may be trivial. If you're dealing with more than 32000 subdirectories or need to partition some major petabytes/exobytes then you might not have a choice. Some of these benefits are probably not risking your ass for but if there's a business need that cannot be overcome any easier way then back your shit up and do rigorous testing before you go live with it. If you're using Slashdot to feel out if the majority of users scream OMGNOES so you don't waste your time doing that, then that's fine. Just don't do this if you don't have to.
I tell you what, there's a $288 desktop computer at Dell today [hot-deals.org] that you can buy, put ext4 on and your OS of choice and your application(s) and whipping boy it into next century without risking anything. Where I work we have two servers in addition to our production servers. I don't think this is an uncommon scheme so if you have a development server, throw it on there and poke it with a stick. Then move it to the testing server and let your testers grape it [youtube.com] for two weeks. Then you'll know.
Wrong question (Score:5, Insightful)
You are asking the wrong question. Ext4 does not need fixing, the apps do.
Are your apps patched yet?
Re:Risk Vs Benefits Analysis (Score:5, Insightful)
> If it ain't broke, don't fix it!
This.
Re:Wrong question (Score:5, Insightful)
There was no single loser here.
Ext4 should handle the case gracefully, but the apps will fail on other filesystems, and they *will* be run on those filesystems, so they should fix the bugs.
No (Score:3, Insightful)
We avoid anything that has less than 24 months of wide deployment unless there is some absolute pressing need to move to an unstable/untested product.
We have test and development systems where we run latest and greatest, but generally they are used in sync with the existing system. We don't switch over until we're damn sure there aren't any unforeseen consequences. That typically means 12 months without any major hiccups and 3 months without minor ones.
Re:Wrong question (Score:3, Insightful)
You are asking the wrong question. Ext4 does not need fixing, the apps do.
Are your apps patched yet?
At the risk of revealing just how incredibly inept I am about file systems ... shouldn't your "apps" (and by apps I am guessing you mean applications) be calling the operating system to do anything to the file system? I mean, isn't the point of operating systems to create or contain APIs and the like that allow you to interface with any file system type that the OS supports?
I guess what I'm asking is just the technicality that only his operating system need be patched and tested for it?
Again, I don't really do this type of coding and in all the C programming I've done, I've never seen a need or way even to get down and dirty with the file system. I can dream up cases (like Google's bigtable) where that may be desirable with benefits if well planned but I would imagine most of the time it would be unwise and unsafe and put you dependent on a type of file system.
You're Asking Slashdot? (Score:2, Insightful)
Re:Wrong question (Score:5, Insightful)
Only on Linux is it the user's fault that apps have data loss because the Linux kernel people changed filesystem semantics. At least Microsoft takes some responsibility for their mistakes :-/
I did follow the ext4 debate. Here's my quick synopsis.
I do have a moral to this story. Filesystems have one cardinal, inviolable rule. DO NOT CORRUPT THE USER'S DATA. The guarantee is that if a user makes a read, the user will get back either good data OR an error (or explicit indication of no data). Google likes filesystems that lose data - but they don't ever give back corrupt search results. Ext3 can reorder writes - but defaults to a safe 5-second flush rate to keep the window of unexpected corruptions small. Ext4 ignored this rule and allows silent data corruption so that this filesystem can be the best at certain microbenchmarks, and instead of accepting responsibility, the kernel hacker in question blames everybody else.
The greatest danger to Linux's success is not Microsoft. It's the hubris of many Linux developers, users, and advocates, who are too busy disavowing responsibility and blaming everybody else to fix real user's problems. (And yes, I'm a follower of the Raymond Chen philosophy)
wait until at least 2.6.30+ (Score:2, Insightful)
last I checked some patches for the dealloc empty file problem was being merged in 2.6.30. if you want to avoid it but want some other advantages like faster fscks you could go with data=journal on your filesystems which is a bit slower but also disables dealloc, while still having extents, barriers, and other ext4 benefits. I've been using data=journal on my /home partition without a single problem.
it also depends a lot on what you have in 'production'. a web server that's mostly doing reads it should be fine for. a heavy email server... well.. can you afford to lose email on a crash? I think it might be alright for a server that just does mta but not the fs for the actual mailbox's (with dealloc anyways). database server should be fine, because the database's job is to make sure data hits the disk, among other things. dns servers are a very read heavy so again I would think it'd be fine. so basically you need to watch anything that's heavy write and not to a database, and even then only with dealloc.
still as I'm sure others have said, it's a good idea to wait on new tech like this. some tools don't yet recognize that ext4 is not ext3.
Re:Risk Vs Benefits Analysis (Score:3, Insightful)
What do I gain by running with ext4?
Is that gain worth the time spent changing what I've got?
If the answer to the first question is that ext4 is cool and shiny, and the answer to the second is unknown, the OP has his answer.
Filesystems are one thing we need to be VERY conservative about. We need to be certain that it works reliably, because we do not need to find our work disappearing out the end of our backup cycle after having discovered problems too late. (Yes, I know, what is this "backup" of which I speak?)
I still have drives running ReiserFS, and I still use ext2 for boot partitions mounted readonly. I pretty much trust those systems, but even so, I still take backups and test them when I can.
Re:Wrong question (Score:1, Insightful)
By your logic, web standards should be changed to match the behavior of Microsoft IE. Since IE is the most popular browser, it should not be forced to conform to the incompatible ("faulty") web standards.
This is exactly why we need precise interface specifications, along with powerful tools for checking against those specs. Otherwise, application developers will find some idiom that appears to work without regard to whether they are assuming more than the spec guarantees. As a result, their code is broken. The current OS code might not expose the error, but a future one will. The OS code should not have to include hacks for every possible interface error that could be present in application code.
Re:Wrong question (Score:5, Insightful)
This statement is incorrect. Suppose you want to atomically replace the contents of file "foo". Your application will write a file "foo.tmp", then call rename("foo.tmp", "foo"). At no time on a running system does any process observe a file called "foo" that does not have either the new or the old contents, and this invariant holds true whether or not "foo", "foo.tmp", or any other file has been flushed to the disk.
On the filesystem level, the kernel can actually write the contents of foo.tmp to disk whenever is convenient. The only constraint is that the on-disk name record for "foo" must be updated to point to the new data blocks from foo.tmp only after these data blocks have themselves been written to disk. That's the issue here: without that ordering guarantee, the kernel can write a file's name record before its data blocks. If the system crashes after the name record is written but before the data blocks are, what's observed on the recovered system is a zero-length file.
That's the problem here: the kernel is conjuring out of thin air a zero-length file that never actually existed on a running system.
Forcing applications to call fsync is not only an onerous burden on application developers, but it also reduces performance because it gives the filesystem less freedom than the much looser constraint on rename above.
Flushing on close is the wrong thing: it far exceeds the minimum requirements that most applications actually need, which will substantially reduce performance.
Not reassuring (Score:4, Insightful)
He presents three common cases for 'quickie' file modifications:
-Modify-in-place. Yes, this logically cannot be expected to leave the content intact in an unexpected interruption. You ask the OS to blow away data, then send it new data, there is a logical indeterminate state in the middle where doing things in the order you specified leaves you exposed.
-Write new file, use rename, using fsync to ensure a low exposure of data. This forces data to disk so it's coherent.
-Write new file and then use rename without fsync:
*This* he claims should easily be expected to corrupt the contents. I take issue with this. The fact that this occurs is because ext4 commits the rename out-of-order ahead of the data commit. I don't understand why the rename operation cannot also be delayed until after the data has been written out. I've seen several people ask 'I don't care that the change happens *now*, but I want the changes to occur in the order I specified', and thus far have seen Ts'o miss that point (intentionally or unintentionally). I have not read any explanation of why changing hardlinks should logically be an operation to jump ahead of pending data writeout. I could be missing something, but I'm not the only one with these questions.
fsync gives a relatively expensive guarantee above and beyond what people require to behave sanely. He says its inexpensive 'now' relative to the past. However, 'now' in this context only applies to ext4 users and thus the operation degrades other filesystem performance and fsync remains an expensive operation relative to not doing at all.
In terms of the general attitude of filesystems shrugging off data consistency so long as their indexes are intact, I find myself agreeing with Torvalds' comments on the debacle:
http://thread.gmane.org/gmane.linux.kernel/811167/focus=811700 [gmane.org]
Re:I think it's "safe enough" (Score:3, Insightful)
Re:Ye (Score:5, Insightful)
So you used the "riskier" fs for / where you don't actually need the features it provides and used the "more stable" fs where features could actually be useful because app/fs developers couldn't agree on semantics?
Only on Linux...
Re:Wrong question (Score:5, Insightful)
POSIX is a red herring here. It covers the behavior of a running system, and makes no guarantees about atomicity or durability following a crash. After a crash and as far as POSIX goes, it's perfectly legitimate to overwrite the entire disk with hentai. Every crash recovery technique goes beyond POSIX because POSIX says nothing about crashes.
It most certainly does! On a running system, if you rename B over A, at no point does any process on the system observe a file called "A" that does not have either the contents of the old A or the contents of B. THIS ATOMICITY IS A FUNDAMENTAL POSIX GUARANTEE.
Filesystems should do their best to honor this guarantee (which always applies on a running system, remember) even when the system crashes. Filesystems don't have to do that according to POSIX. Instead, they should do it because it's a sane thing to do, and doesn't violate anything POSIX guarantees. POSIX is not the arbiter of what a good system should be. It's perfectly reasonable to make guarantees that go beyond POSIX, and every real-world operating system does precisely that. POSIX guarantees are necessary but insufficient for a reasonable system in 2009.
Re:ext4 is buggy (Score:5, Insightful)
But he uses R-A-I-D! R-A-I-D magically makes data bulletproof and immune to disaster as we all know.
Seriously, running a 3TB RAID with a buggy fs and applauding faster fsck times instead of wondering why the fs gets fucked up constantly must be the peak of idiocy.
Re:Wrong question (Score:3, Insightful)
Performance optimization. You can get much write rates if you can reorder the writes to be sequential on disk, starting with whichever one the disk head can get to first.
Re:Wrong question (Score:1, Insightful)
Unfortunately, "fixing" apps to work around ext4's brokenness means you have to fsync the new version of a file before renaming it over the old one. So instead of having KDE's 500 config files being lazily flushed to disk in a single 10-millisecond disk write, each one gets written synchronously, hanging your system for 5 whole seconds. Brilliant.
Or, I could just use ext3, which gives sane behavior (preserving either the old or new version of a file, don't care which) and doesn't require apps to be written in a way that makes you feel like you're running DOS on floppy disks.
EXT4 is not broken? (Score:3, Insightful)
Re:EXT4 is not broken? (Score:5, Insightful)
It's working exactly as designed. It's the applications that need fixing, no?
Does it matter whose fault it is when users are losing config files? It worked fine before, and now one of my basic expectations concerning Linux is broken: that no matter what happens short of hardware failure, I will not lose the files I already have. We're disappointed, and pointing fingers does not help.
Re:Wrong question (Score:2, Insightful)
At least Microsoft takes some responsibility for their mistakes
Actually, I'll take the process you described above over what occurs at Microsoft or other closed-source shops any day. They also have their fair share of stubborn, arrogant developers with the kind of attitude displayed above. The reason you don't see the kind of detailed analysis of what happened all the time like the one above is simply that it all occurs behind closed doors. Oh, and because of that, you don't see the kind of outcry that eventually leads to patches until after the product ships, if ever. Microsoft can say "we can't help it if a hardware crash corrupts your data" as well as anyone else.
Everything else in the post is right, it's just wrong in the implications that this is somehow unique to Linux, or indeed anything other than substantially less common in Linux than at Microsoft or other such corporate development communities. Frankly, it's more common where it's less likely to result in a public airing like the one above.
Too cheap of an excuse.. (Score:3, Insightful)
His point was that POSIX doesn't speak to crash behavior. As such, if a system detects a crash and zeroes the MBR and nearby blocks, it would still be POSIX compliant, but no one would plausibly be mollified by that.
The application isn't making a complex assertion based on undocumented behavior not contained in a spec, it's making a very simple assumption that if it writes data to a file, and then calls rename when those calls complete, that those two operations will proceed in order. It proceeds in order on the running system, and the desire expressed is that same ordering guarantee occurs to persistent storage (it is acceptable to be stale/lagged, so long as the second operation didn't jump in front of the other).
Re:Wrong question (Score:4, Insightful)
Yes.
Precisely.
NO, NO, NO. write, fsync, close, rename is how you spell "atomically replace this file" in terms of system calls. It does precisely the correct thing on a running system. You yourself admit that it "used to work". It has worked for decades, in fact. (Though before journaling filesystems, all bets were off after a crash.)
That sequence of system calls is how applications tell the kernel to replace the given file. There is no useful interpretation of those system calls that doesn't involve an atomic replacement of the whole file. We don't need a separate system call: we already have the system calls. Nobody executing those system calls wants the dangerous interpretation of rename. At no time did an application developer sit down and think to himself, "I want to tell the kernel to perform an atomic rename, except when the system crashes. In that case, I want a zero-length file." Gods, no. Obviously, the application developer wanted to atomically replace the named file. Filesystems just need to honor the obvious intent of application developers.
Re:EXT4 is not broken? (Score:3, Insightful)
But is the design any good? If the advantage of EXT4 is better performance, how much of that performance improvement will be lost once the applications are fixed?
Re:EXT4 is not broken? (Score:3, Insightful)
Does it matter whose fault it is when users are losing config files?
Finding out where the problem lies is a pre-requisite for fixing it.
It worked fine before, and now one of my basic expectations concerning Linux is broken: that no matter what happens short of hardware failure, I will not lose the files I already have.
The out-of-spec-apps-saving-files-on-ext4-loses-files bug is only a problem with hardware failure.
We're disappointed, and pointing fingers does not help.
Well, sure, it doesn't help now. ext4 was quickly amended to behave more like ext3, and there is no reason to bitch about the past.
Re:EXT4 is not broken? (Score:3, Insightful)
If you want to ensure your data makes it to disk, use fsync() like the specs say. If you won't use fsync(), don't complain when the FS loses your data; the specs say it MAY randomly lose for any reason, unless you fsync(). If you just want Consistency and not necessarily Durability, just make a foo.new file and rename over foo.
Re:EXT4 is not broken? (Score:4, Insightful)
Didn't your mom teach you not to forcefully shut down any operating system with any file system? Just because it has measures to reduce the damage doesn't mean you can abuse it. So in this case, it is your fault.
And here I was going around all this time, feeling sorry for ext4 users who actually experienced system crashes due to bad graphics chip drivers or some other similar and silly problems. But no, it turns out that people who complain most are those who rely on operating system being able to resuscitate itself.
There's a reason why the filesystem syncs itself at the end of shutdown process, and why it is expected that you follow the process to the end. There's a reason why shutdown process exists in the first place. Throwing poor insults like "ext4 ranks with Windows 95" (perhaps you mean Win95's implementation of FAT?) doesn't help. Sure, it shouldn't lose stuff when the unexpected happens ... but you shouldn't rely and expect it will. Unexpected is just that -- unexpected -- and you'd better be prepared for it the next time your desktop falls over while it's turned off and your drive dies a horrible death. Because God, Buddha, Allah, Shiva or someone else will make sure that happens to you, if you've raised yourself to expect that FS will survive being constantly forcefully turned off.
kthxbye.
That is something I find peculiar... (Score:4, Insightful)
When they went to journalling filesystems, by and large a simple mount operation turned into a mini-recovery operation, a psuedo-fsck if you will. This would even happen on read-only mounts, which to me violates expectations of no disk data being modified.
JFS had one 'quirk' that I think they got right, journal replay was an fsck-level event. A filesystem with a dirty journal could only be mounted read-only and the journal replay code was in fsck and had to be ran to enable remount read-write. There are numerous reasons why I stopped using JFS, but that is one point I kinda agreed with their quirkiness on.
To actually try to answer the question... (Score:2, Insightful)
...the three reasons are performance, performance and performance.
Ext4 has extents (and therefore loses indirect blocks), a better on-disk layout policy and generally better algorithms in its allocation code. Of course, performance varies depending on the app in question but we've found that it beats ext2 in almost every respect in our environment. (We don't run ext3 because journals cost performance [by buying reliability] and that's all ext3 gets you: a journal. This is why we wrote and submitted the no-journal hack for ext4.) In particular, ext4 beats ext2 for write-heavy loads by, well, lots. Yes, we've measured this stuff.
So why would one go to ext4 over ext3? Because it's a better file system, not to mention one that's actually (a) being developed and (b) past pre-alpha.
Of course, our environment is a tad different from most. We have *ahem* more than a _couple_ of servers.
Re:EXT4 is not broken? (Score:4, Insightful)
EXT4 is broken.
Posix requires that writing a file and then renaming it to a new location is an ordered atomic operation. Say file B already exists. You write file A, then close it, then rename (mv) it to B. Another program running at the same time opens B and reads it. It will get one of these two results, and NO OTHER RESULT:
1. It sees the old contents of B
2. It sees what was written to A.
EXT4 (before these patches) could result in the following result if your machine crashes and you start it again and look at B:
3. B is empty (also B is various partially-written versions of A, but empty most common).
Now it is true that Posix says that if the machine crashes, all bets are off. So yes EXT4 is being technically correct. But it would be equally technically correct if all the files on the disk were empty so this is pointless.
EXT4 promises to make crashes recoverable. This implies to me that after you recover from a crash, you will be left in a state allowed by POSIX. This means either you get the old contents of B or the new full contents of A, and EXT4 by allowing a different result is breaking it's design and promise.
Re:most apps already did the 2nd; still failed (Score:3, Insightful)
Ah the sync should come before the rename. I understood the problem as kde was truncating the old file before the sync. If you have the above system, why wouldn't you copy foo > foo.old, before working on foo.new? At the worst then, user can copy foo.old back to foo; assuming there has been a crash between foo.new rename and sync. I thought this was the standard practice that the apps forgot to do.