Minimum Seek Hard Disk Drivers for Unix? 58
Jonathan Andrews asks: "I remember back in the old days reading about a filesystem/device driver that had alomost no seeks of the physical disk. It worked by scanning the heads of the disk from track 0 to the end then back again in a constant motion. Disk reads and writes where cached so that they got written to disk only when the heads where on that part of the platter. My question is simple, now that disks are IDE, have lots of heads and even worse differing Heads/Cylander/Sector translation scemes is this type of system even possible? Would you have to fight the disk cache on the drive? I seem to recall it giving real throughput advantages, if the cache was large enough to hold 'one sweep times worth of data' then the cache almost never blocked and disk writes/reads sustained at the max throughput all the time. Best of all it gets rid of that blased seeking, chewing, seeking noise!"
Applications block on writes a lot (Score:5, Informative)
The other problem is that all applications block on reads. So when an application wants to read, it can't do anything else until the blocks are read off disk. So there is contention to get reads done quickly. As a general rule another read is going to happen very soon very near that, it might be earlier, it might be later. Linus proposed an interesting optimization to this at on the LKML suggesting that you create an mmap call that says, I'm interested in this page of this file, please start putting it here. Then go off and do other work, and when you access that page, if it isn't yet loaded from disk, the pointer access blocks until it is available. Then these types of reads are less pressing the reads which are blocking progress of a process.
The reason most people changed the scheduler is because it is faster for a known workload that is important. So they got tweaked out. Buffering a full write cycle is easy, but the fsync means some writes are more important then others. Reads are almost always blocking a process. So seeking is important. Now throw mirrors into the mix. Where having two copies means you can have two independant heads searching the same information. Oh, but they have to sync everytime there is a write.
The other thing is that prior to the Linux 2.5 kernel, Linux had spent all it's time assuming that a disk had no C/H/S, it was a big flat row of sectors. It explicitly ignored the C/H/S for filesystems and disk head algorithms. 2.5 is supposed to introduce the concept of a spindle, and spindles to have C/H/S, and then they can be optimized for.
Now remember that most of write concerns will become unimportant as soon as battery backed up becomes a standard part of every computer. Then what will happen, is everyone will run journaling filesystems where the journal is stored on the battery backed up RAM (so there are no seeks for writes), then you only queue writes when there is pressure on the battery backed up ram, and try and do reads as fast as possible. When the battery backed up ram is nearly full, you start from the inside moving outside, writting every track as you go. Then you can use n-way mirriors to speed up reads.
Kirby
Re:battery backed up ram standard? - not (Score:3, Interesting)
hahaha. tell another one.
Re:battery backed up ram standard? - not (Score:1)
Re:battery backed up ram standard? - not (Score:4, Interesting)
There was a story about exactly that type of card sold by a company on slashdot. Google for them, they aren't that hard to find.
They will become a standard component in machines when the time is right. I'd pay $1,000 for a 512MB one every day of the week if it had a driver under linux that made it look like a block device that was reliable. No questions asked. Having a pair of those and putting a copy of Oracle's Redo logs on them would probably double the speed of my Oracle database, and I just paid $15K to have 3 guys come in and get me a 20% increase in speed. No questions asked, I'd pay for one. If you could make a pair of highly reliable versions of those, I bet I could sell $1 million dollars at 80% profit of them in less then 3 months as soon as word got out about what they can do for Database and filesystem performance. I don't have the personal capital to do so, or the technical skills to pull it off. I'm a programmer. Just as soon as I figured it out, somebody in Taiwan would put me out of business in a week, because that is the land of faster, better, cheaper for computer components.
High end, permanent storage that has no seek time, will become a standard feature on highend servers just as soon as journalling filesystems become capable of putting them on seperate devices. Right now ext3 can do that. I'm not sure about the others. Right now the only other real use is for Oracle Database. Everyones current opinion is just throw more disks at it. It's cute, but someday they'll figure out that it's highly cost effective to do have filesystems that do lots of fsync's on a SRAM based filesystem. It's just another layer of caching, but the caching is permanent storage.
Eventually, they'll become incredible cheap. Battery backed up ram or SRAM isn't that expensive, and in volumne it would come down. Its a lot more useful then those damn USB drives that hold 64MB of data. Those things sell in volume. I'm not sure they'll be built into the motherboard, but I'd be surprised if they aren't available for sale within 2 years as an expansion card.
Kirby
Re:battery backed up ram standard? - not (Score:3, Informative)
Re:battery backed up ram standard? - not (Score:2)
Re:battery backed up ram standard? - not (Score:1)
Your original comment concerned mission critical servers. Hopefully these servers are not built on the cheap with "home PC" quality components. Many commercial RAID boxes (Fujitsu/Clariion 4900, e.g.) use battery backed-up cache RAM to ensure data integrity whilst eliminating the sync delay for critical writes. The algorithm for performing the delayed writes efficiently is in the RAID box, not in the OS. The only thing the OS needs to be aware of is that the RAID subsystem is battery backed, so the OS can bypass the syncs.
Re:battery backed up ram standard? - not (Score:2)
We have a pair of redundant ServeRAID 4 cards, that have 64MB of battery backed up RAM. Unfortuantly, I can't specify that control how much of that is used for what. A good RAM Disk that I could tune specifically for my environment would be much better then the caching that the ServeRAID cards do.
Want to know why Oracle is the world's fastest database? Because it gives you an incredible amount of flexibility to tailor your solution to your problem. If I could dedicate 1GB of RAM to the problem, I'd easily triple the speed of my thru put. No problem.
Kirby
Re:battery backed up ram standard? - not (Score:3, Informative)
So I yanked it out and turned the battery off. But my point is, it's been around for ages, and if I could use it I would.
Re:Applications block on writes a lot (Score:2, Informative)
That depends on the operating system. Operating systems with proper asynchronous I/O support allow applications to issue non-blocking reads. I used to make extensive use of non-blocking reads and writes when I wrote applications for DEC operating systems.
Re:Applications block on writes a lot (Score:3, Informative)
Kirby
Lag! (Score:3, Insightful)
With 50% of the disk cached in memory, half the time, you'd have to wait for an average half-cycle to get your data. And still an expensive hard drive.
With 0% of the disk cached in memory, you'd still have to wait for an average half-cycle for all disk requests.
Now with any amount of cache, you get an overpriced drive and/or poor performance.
Re:Lag! (Score:2, Informative)
maybe for write... (Score:3, Interesting)
That would work fine for writes -- but we already have write-behind cache. We also have read-ahead cache, so that once you've sunch to the proper location, the first read will result in that whole general section's being read, in anticipation of future reads from that area -- if it turns out not to be necessary, it'll eventually be overwritten by future read-ahead caches.
The problem with what you're proposing, of course, is that there's still the inital seek time to that location.
Why would you defer your read until you got to where you were going "naturally", instead of doing so immediately? It would increase the total time until read.
For example, suppose you are trying to read some data that's almost at the edge of the outer ring, but that you issue your request immediately after the read arm has hit the edge outward and started going inward, having already passed the data you need. At this point, a simple seek would be almost instantanious, since you could just move back to where you needed to be -- but under your "continual motion" scheme, you would need to wait all the way until the arm travelled to the inside of the platter, then all the way until it travelled back to where it needed to be again.
Of course, your plan would work fine if you had a cache the size of the whole damn platter your reads were coming from -- then you could conintuously read in one swerving motion the whole platter, and write back to it only when necessary. This is not, however, what I think you meant.
So take-home lesson: We already have more than adequate write caches (dangerously so -- sometimes power loss means that megabytes and megabytes of data that have been reported as written to disk are only waiting to be written to disk, and if you don't power up the hard-drive before the battery runs out protecting the cache, you risk corrupting your data.)
As for "read-behind caches" (i.e. reads to data based on requests you're going to receive, not based on requests you've already received), isn't really feasable.
Note: feel free to correct me, I'm no hard drive expert.
*WHY* elevator seeking (Score:1)
For example, suppose you are trying to read some data that's almost at the edge of the outer ring, but that you issue your request immediately after the read arm has hit the edge outward and started going inward, having already passed the data you need. At this point, a simple seek would be almost instantanious, since you could just move back to where you needed to be -- but under your "continual motion" scheme, you would need to wait all the way until the arm travelled to the inside of the platter, then all the way until it travelled back to where it needed to be again.
Which pretty much sums up why elevator seeking would be a huge waste for a single process. Meanwhile other people remember that Novell had (has?) it and that it did improve average seek times. The key is that elevator seeking maximizes overall I/O throughput when you have a whole lot of processes independently accessing different areas of the disk. That's why it was added to Novell, back in the beginning of PC time when Novel was the best dang file server on the planet but nothing more than a file/print server.
I could also imagine that it could be useful for a database, where you have the log on a separate volume, either with no elevator seeking or with it effectively defeated by constant fseeks, and the data and indices on volumes with elevator seeking on. Remember data into the log is typically faster than writing it into the database precisely because it requires less seeking...
Elevator seeking (Score:1)
Re:Elevator seeking (Score:2)
Done. (Score:5, Informative)
Re:yep (Score:2)
No, ./'s posting scripts rip all of the slashes out of a file:blahblahblah URL. So file:///etc/resolv.conf would become file:etcresolv.conf - not particularly useful, but I guess it protects some Windows users against a future hole in their OS's poor excuse for MIME.
Elevator Seeking (Score:2)
--Snipped from the Novell knowledge base [novell.com]
Elevator seeks reduce the average wait time for data to be returned and therefore increase performance. But as you already realized it has to understand the underlying geometry of the drive.
Disk scheduling (Score:1)
Silberschatz and Galvin's "Operating System Concepts", 5th ed, covers this topic in detail in section 13.2 ("Disk Scheduling"). There, they cover FCFS, SSTF, SCAN, C-SCAN, and LOOK hard disk scheduling.
It was my impression that most newer hard drives actually had this capability built into firmware, but an OS could manage it too, as S. & G. suggest. Don't know much more about it than that, but this may give you some inroads on where to search for more info, if you're interested.
Re:Disk scheduling (Score:2)
(after reading the first several chapters, i stopped reading though, cause i felt i was having negative knowledge flow)
Yeah I remember Novell Netware (Score:2)
Then I got old and cynical and stopped wondering...
Re:Yeah I remember Novell Netware (Score:2)
Re:Yeah I remember Novell Netware (Score:2)
When I turned that off I got a certain Clarion report to go from 45 minutes of thrashing to 5 SECONDS of purring...
Re:Yeah I remember Novell Netware (Score:2)
Can you provide some more info on this? I'm interested in what you did.
Re:Yeah I remember Novell Netware (Score:2)
Check the box and reboot - see - under the default it tries to write multiple files at once, each fragment requiring a FAT update, without it it does one file, updates FAT, does another etc....
Re:Yeah I remember Novell Netware (Score:1)
Re:Yeah I remember Novell Netware (Score:2)
I remember BSD 4.3 (Score:2, Informative)
The theory is you divide up the disk into cylinders or groups of cylinders (depending on your cylinder size). When you write files out to disk, the directory and the files for the directory were all aimed at the same zone or a closest adjacent/available zone.
The idea was to use people's logical mapping as a guideline in physical layout on the disk. That way seeking is minimized within some small number of cylinders.
You, maybe, have to hold off on writes to allow them to group to be efficient, or worst case writes involve much seeking if different processes are writing single blocks to different directories with each write, but reads go much faster...
Maybe you would wait around after a read at the same area of the disk while the process that was just 'awakened' by the finishing of the disk I/O finishes it's timeslice. If it doesn't get scheduled next or isn't likely to within "X" time, or if one accumulates other read requests that are more than "Y" time old, then you services the other processes.
Values of "X" and "Y" depend, on disk seek times/latency, timeslice length, process priority, number of disk-blocked processes...etc. They'd have to be dynamically altered based on load.
But that's what...circa 1989 tech? Good thing we've come so much farther than those primitive algorithms.
Those who do not learn from history are doomed to repeat it.
It's amazing how, even now, many "modern" security concepts are being "discovered" that were already known back in the 60-70's.
Maybe it's the tendancy in the industry to discard experienced programmers for cheaper-meat so generational knowledge that can be passed on over a generation (~20-30 years?) is lost as the "generation" time is shorted to 10-15 years. Dunno. Maybe modern programmers don't have time to read books on what's already been done because they know that whatever has been done in the past, they can do better (i.e. Egoism ignorance)?
hmmm
Dumb question (Score:1)
Re:Dumb question (Score:1)
As for the spindle in the drive wearing out, it doesn't make much difference either way. The platters are spinning at all times anyway unless you specifically spin down the drive.
Tagged Command Queueing? (Score:3, Informative)
Re:Tagged Command Queueing? (Score:3, Insightful)
Re:Tagged Command Queueing? (Score:2)
One thing (the only thing afaict) about Novell; I don't know of any other OS that could provide multiple filesystem mounts to hundreds of PCs from a 386 with 128MB of RAM. Novell did this routinely.
Re:Tagged Command Queueing? (Score:2)
Netware 6 still kicks some serious ass - it's easy to get limited-user eval versions to mess around with... it's still the only server out there designed from the ground up specifically to be JUST A SERVER.
Supports any type of client these days, too. Comes w/ Apache preinstalled, even... the list goes on.
I still think it's fun to play with.
Geometry (Score:5, Informative)
I wouldn't say that disks have lots of heads - 2 to 4 is probably typical, and there are models that only have 1 (e.g. Maxtor "Fireball 3" series).
Don't pay any attention to the BIOS geometry that claims 16 heads - those numbers are pure fiction. In fact the whole concept of the C/H/S geometry is obsolete. It assumes that the number of sectors per track is constant, when in fact it varies. Outer tracks have a larger circumference, so they can hold more sectors. A drive might have 30 different zones, each with a different number of sectors per track.
Tools like "zcav" (bundled with the "bonnie++" benchmark utility) will show this quite clearly, because the sustained transfer rate of a disk is proportional to the number of sectors per track. The rotation rate is constant, so more sectors-per-track means more sectors-per-second passing under the head.
Drives tend not to expose the information you need to translate logical block addresses into physical locations. Apart from the zones, you also have the problem of re-mapped sectors (where a spare physical sector is substitued at the logical address of a failed one).
p.s. "drivers/block/elevator.c" in the Linux kernel might be of interest.
Freeblock scheduling (Score:2)
Surely possible, especially if you only go forward (Score:2)
By the way, do remember that drives don't actually have so many heads as you see in the bios. The values the BIOS reports rarely have anything to do with the real hardware anymore. This is just left over from the days when you had relatively low track limits, and so drives would lie, doubling or quadrupling heads and cutting track counts in half or quarter.
on-drive translation defeats elevator sorting (Score:2, Insightful)
All the drives on the market today do internal translation from the C/H/S numbers they advertize to their own internal structure. So I think that implementing elevator sorting or other such algorithms at the OS level is a waste of time unless you know the mapping method (which isn't going to be the same on all drives). You can guess that increasing linear addresses correspond to higher cylinders but even if that were true, you can't tell where the boundaries are, since not all cylinders are created equal.
Furthermore, drives have large caches today which depending on how they are used (and I imagine they are used in write-back mode mostly) effectively hide the disk operations from the OS.
I would assume that this kind of stuff should be performed in the drive's firmware anyway, if it's not already. If the communication protocol supports sending multiple commands to the drive (I think SCSI does that, not sure about IDE) then the drive could reorder the requests to make better use of the drive geometry.
First use of this technology... 1984 (Score:3, Informative)
A Fast File System for UNIX (1984)
Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler, Robert S. Fabry
http://citeseer.nj.nec.com/mckusick84fast.
Current BSD's have this capability, but it is generally disabled, because modern disk drives *lie* about their physical seek boundaries.
Theoretically, you can work around this with SCSI disks by reading the physical geometry off of mode page 2, and then taking it into account when laying out data, to avoid seeks. Maxtor also has a vendor private command for getting this information from some of the more modern ATA drives.
The BSD FFS code can't handle this information without work, though, because the code is very simple, and supports only the idea of uniform length tracks, and does simple math, rather than a table lookup (but it's not that hard to change).
Practically, you could expect a significant speedup, now that the relative spindle speed vs. seek speed makes seeks significant on 10,000 RPM drives; for most drives for the last 5-6 years, though, the seek is in the noise, and it's not that big a win (stepper moters vs. voice coils was the big change that made it matter much less, the first time).
-- Terry
Re:First use of this technology... 1984 (Score:2)
1984? Hardly. Concurrent Computer had this technology available on its line of minicomputers since the late 70's. They called it "sweep" instead of elevator, but the concept was the same.
On a related theme, I seem to recall that Tivo wrote their own seek scheduler for their PVRs - you might want to check out their source to see if it's "elevator" or not.
The Name is Log-Structured Filesystems (Score:5, Informative)
The original work was done by Mendel Rosenblum, one of the founders of VMware and the most recent (2002) winner of the ACM SIGOPS Mark Weiser award.
The problem, as it turned out, was the cleaner. It put too much load on the disk. The original theory was that the cleaner would run overnight, but on a continuously loaded system there was never idle time to use to run it.
In 20/20 hindsight, the idea was clearly flawed. If you look at my list above, you'll see that you are getting rid of scrambled writes by giving up sequential reads. Since reads are cached, you're (on average) giving up 1 approximately sequential read to get 1 sequential write. But that's wrong because occasionally the cache misses, so instead you give uyp 1.1 (or 1.001) sequential reads to earn 1 sequential write. Worse, you also have to pay overhead to the cleaner.
I can argue strongly that the only reason LFS ever saw the light of day was that the benchmarks used to evaluate it wound up highlighting its strengths and hiding its weaknesses. I don't think that was intentional, but it's what happened.
The most recent LFS work was by Drew Roselli, in the late 90's. She identified a lot of the causes of slowdowns in the original system, and found ways to mitigate them. Even so, though, the system has never lived up to its promise.
BTW, don't confuse LFS with journaling filesystems such as ReiserFS, XFS, and ext3. LFS had some journaling aspects, but its focus was performance rather than crash-proofing. One can argue that LFS influenced journaling filesystems, but it's not the same.
Re:The Name is Log-Structured Filesystems (Score:1)
I don't understand this one. Surely if you have a lot of RAM and a write back cache, the writes can wait indefinitely for the head to be in a convenient location.
Re:The Name is Log-Structured Filesystems (Score:2)
You can't wait indefinitely, of course, for two reasons: first, there's a limit to the amount of cache, and second, you want to sync stuff to disk for reliability. But I would have been more accurate to say that the leftover writes cause more head motion than absolutely necessary.
Leave it to the disks (Score:3, Insightful)
So if you have a Scsi drive, just give everything to the drive and let it get on with the job. I believe SATA drive also have the theoretical capability to to this, but most of them don't actually provide it yet.
When you have given what you can to the drive, think about multiple drives. Mirrors. Raid4/5, with the overhead of parity generation. Multiple filesystem vs striping.
Get coffee..
WAFL from NetApp? (Score:2, Interesting)
What you want is Tag Queuing (Score:1)
SCSI drive have been doing it internally for a long time now. IDE drives are just starting to get this working. Link [extremetech.com]
Practical C/H/S with a bench ? (Score:1)
Accessing the disk through LBA at low level with precise timing let you discover when heads seek. Thus you can compute actual C/H/S geometry and the mapping to LBA. Write once the map to a file along with disk id. Use the map with ad-hoc driver which optimize reads and writes accordingly.
Compute this map will be time consuming but it need to be done only once.