Forgot your password?
Linux Software

Minimum Seek Hard Disk Drivers for Unix? 58

Posted by Cliff
from the optimizing-tricks dept.
Jonathan Andrews asks: "I remember back in the old days reading about a filesystem/device driver that had alomost no seeks of the physical disk. It worked by scanning the heads of the disk from track 0 to the end then back again in a constant motion. Disk reads and writes where cached so that they got written to disk only when the heads where on that part of the platter. My question is simple, now that disks are IDE, have lots of heads and even worse differing Heads/Cylander/Sector translation scemes is this type of system even possible? Would you have to fight the disk cache on the drive? I seem to recall it giving real throughput advantages, if the cache was large enough to hold 'one sweep times worth of data' then the cache almost never blocked and disk writes/reads sustained at the max throughput all the time. Best of all it gets rid of that blased seeking, chewing, seeking noise!"
This discussion has been archived. No new comments can be posted.

Minimum Seek Hard Disk Drivers for Unix?

Comments Filter:
  • by ComputerSlicer23 (516509) on Monday March 03, 2003 @08:09PM (#5428577)
    A lot of mission critical applications block on writes. Take a look at a mail server, or a database server. They call fsync an a lot. You don't want that fsync call to return until after the data is written to the filesystem. So it's good for thruput on those systems to ensure the system queues writes that are being waited on by fsync get done really soon now. So seeking directly to where they need to go to write is important for performance.

    The other problem is that all applications block on reads. So when an application wants to read, it can't do anything else until the blocks are read off disk. So there is contention to get reads done quickly. As a general rule another read is going to happen very soon very near that, it might be earlier, it might be later. Linus proposed an interesting optimization to this at on the LKML suggesting that you create an mmap call that says, I'm interested in this page of this file, please start putting it here. Then go off and do other work, and when you access that page, if it isn't yet loaded from disk, the pointer access blocks until it is available. Then these types of reads are less pressing the reads which are blocking progress of a process.

    The reason most people changed the scheduler is because it is faster for a known workload that is important. So they got tweaked out. Buffering a full write cycle is easy, but the fsync means some writes are more important then others. Reads are almost always blocking a process. So seeking is important. Now throw mirrors into the mix. Where having two copies means you can have two independant heads searching the same information. Oh, but they have to sync everytime there is a write.

    The other thing is that prior to the Linux 2.5 kernel, Linux had spent all it's time assuming that a disk had no C/H/S, it was a big flat row of sectors. It explicitly ignored the C/H/S for filesystems and disk head algorithms. 2.5 is supposed to introduce the concept of a spindle, and spindles to have C/H/S, and then they can be optimized for.

    Now remember that most of write concerns will become unimportant as soon as battery backed up becomes a standard part of every computer. Then what will happen, is everyone will run journaling filesystems where the journal is stored on the battery backed up RAM (so there are no seeks for writes), then you only queue writes when there is pressure on the battery backed up ram, and try and do reads as fast as possible. When the battery backed up ram is nearly full, you start from the inside moving outside, writting every track as you go. Then you can use n-way mirriors to speed up reads.


    • you think computers made entirely of commodity components are going to start adding additional ram and batteries to the mix as a standard feature?

      hahaha. tell another one.
      • Do you really think anyone's ever going to need more thank 640kb of RAM? C'mon.. "get REAL"!
      • by ComputerSlicer23 (516509) on Monday March 03, 2003 @10:28PM (#5429681)
        No I don't believe battery backed up RAM will be the standard RAM in a machine. However, I won't be surprised to see cards come out that have that (any high end SCSI card has it now). I wouldn't be surprised at all to see servers released with 128MB of battery backed up RAM at all. They are what makes high end SAN's and Network Storage work now. It'll migrate to lower end systems.

        There was a story about exactly that type of card sold by a company on slashdot. Google for them, they aren't that hard to find.

        They will become a standard component in machines when the time is right. I'd pay $1,000 for a 512MB one every day of the week if it had a driver under linux that made it look like a block device that was reliable. No questions asked. Having a pair of those and putting a copy of Oracle's Redo logs on them would probably double the speed of my Oracle database, and I just paid $15K to have 3 guys come in and get me a 20% increase in speed. No questions asked, I'd pay for one. If you could make a pair of highly reliable versions of those, I bet I could sell $1 million dollars at 80% profit of them in less then 3 months as soon as word got out about what they can do for Database and filesystem performance. I don't have the personal capital to do so, or the technical skills to pull it off. I'm a programmer. Just as soon as I figured it out, somebody in Taiwan would put me out of business in a week, because that is the land of faster, better, cheaper for computer components.

        High end, permanent storage that has no seek time, will become a standard feature on highend servers just as soon as journalling filesystems become capable of putting them on seperate devices. Right now ext3 can do that. I'm not sure about the others. Right now the only other real use is for Oracle Database. Everyones current opinion is just throw more disks at it. It's cute, but someday they'll figure out that it's highly cost effective to do have filesystems that do lots of fsync's on a SRAM based filesystem. It's just another layer of caching, but the caching is permanent storage.

        Eventually, they'll become incredible cheap. Battery backed up ram or SRAM isn't that expensive, and in volumne it would come down. Its a lot more useful then those damn USB drives that hold 64MB of data. Those things sell in volume. I'm not sure they'll be built into the motherboard, but I'd be surprised if they aren't available for sale within 2 years as an expansion card.


        • There are some nice-looking battery-backed RAM cards here [], but I've never used them and don't know how much they cost. Available in capacities of 128M to 1G, with two redundant Li-ion batteries and a 64-bit PCI interface.
        • A lot of mission critical applications block on writes. Take a look at a mail server, or a database server...

          Your original comment concerned mission critical servers. Hopefully these servers are not built on the cheap with "home PC" quality components. Many commercial RAID boxes (Fujitsu/Clariion 4900, e.g.) use battery backed-up cache RAM to ensure data integrity whilst eliminating the sync delay for critical writes. The algorithm for performing the delayed writes efficiently is in the RAID box, not in the OS. The only thing the OS needs to be aware of is that the RAID subsystem is battery backed, so the OS can bypass the syncs.

          • Yes they are all IBM servers, redundant power supplies all the way, 2 way Xeon processors, but the SCSI subsystem is external cases of off the shelf SCSI LVD drives (mirrored on different controllers, on in different external cases, and redundant power supplied). We had a home made machine, that was more reliable, and caused a lot fewer headaches then this "enterprise class" server we bought from IBM. Just rebooting them causes a series of cascading BIOS errors, which cause the ServeRAID cards to come up, and believe they've lost their configuration, which means if it actually boots, you lose all your mirrors. Followed by the fact that on a top of the line IBM card, you can't get more then 30MB/sec to the platters while resync'ing the mirrors under Linux (I called IBM, they said sounds about right to us). They are nice, very fast machines, and it impresses people to see them. The BIOS, and the tempermental SCSI cards make them a royal pain in the ass every time you reboot. We don't do it often, but just what you want is an hour long problem right after an unexpected reboot. Yeah, makes my day.

            We have a pair of redundant ServeRAID 4 cards, that have 64MB of battery backed up RAM. Unfortuantly, I can't specify that control how much of that is used for what. A good RAM Disk that I could tune specifically for my environment would be much better then the caching that the ServeRAID cards do.

            Want to know why Oracle is the world's fastest database? Because it gives you an incredible amount of flexibility to tailor your solution to your problem. If I could dedicate 1GB of RAM to the problem, I'd easily triple the speed of my thru put. No problem.


      • Nothing new about it. My SunSparc 20 has 2MB of non-volatile RAM. Well, actually it doesn't any more beccause Linux doesn't do anything with it (yet). I don't even think Solaris on its own either, you need to use some Storage Management crappy thing.

        So I yanked it out and turned the battery off. But my point is, it's been around for ages, and if I could use it I would.

    • The other problem is that all applications block on reads.

      That depends on the operating system. Operating systems with proper asynchronous I/O support allow applications to issue non-blocking reads. I used to make extensive use of non-blocking reads and writes when I wrote applications for DEC operating systems.

      • That's true. I felt like giving up on my post before I got to covering that subject. Async I/O is in most (all?) Unices. The other way to handle that case is to use threads if kernel threads are supported on the O/S. Most people I know generally don't use Async I/O, because it's much harder to use, and it's harder to successfully organize your programs so it works out very well for you. Just one of those things.


  • Lag! (Score:3, Insightful)

    by ptaff (165113) on Monday March 03, 2003 @08:14PM (#5428617) Homepage
    With 100% of the disk cached in memory, you'd get a pretty expensive hard drive.

    With 50% of the disk cached in memory, half the time, you'd have to wait for an average half-cycle to get your data. And still an expensive hard drive.

    With 0% of the disk cached in memory, you'd still have to wait for an average half-cycle for all disk requests.

    Now with any amount of cache, you get an overpriced drive and/or poor performance.

    • Re:Lag! (Score:2, Informative)

      by Orthanc_duo (452395)
      Without Registers and L2 cache modern Ram wouldn't have a hope... Without a cache you disk access would be unberable.
  • maybe for write... (Score:3, Interesting)

    by 3-State Bit (225583) on Monday March 03, 2003 @08:22PM (#5428687)
    Disk reads and writes were cached so that they got written to disk only when the heads were on that part of the platter
    That would work fine for writes -- but we already have write-behind cache. We also have read-ahead cache, so that once you've sunch to the proper location, the first read will result in that whole general section's being read, in anticipation of future reads from that area -- if it turns out not to be necessary, it'll eventually be overwritten by future read-ahead caches.

    The problem with what you're proposing, of course, is that there's still the inital seek time to that location.

    Why would you defer your read until you got to where you were going "naturally", instead of doing so immediately? It would increase the total time until read.

    For example, suppose you are trying to read some data that's almost at the edge of the outer ring, but that you issue your request immediately after the read arm has hit the edge outward and started going inward, having already passed the data you need. At this point, a simple seek would be almost instantanious, since you could just move back to where you needed to be -- but under your "continual motion" scheme, you would need to wait all the way until the arm travelled to the inside of the platter, then all the way until it travelled back to where it needed to be again.

    Of course, your plan would work fine if you had a cache the size of the whole damn platter your reads were coming from -- then you could conintuously read in one swerving motion the whole platter, and write back to it only when necessary. This is not, however, what I think you meant.

    So take-home lesson: We already have more than adequate write caches (dangerously so -- sometimes power loss means that megabytes and megabytes of data that have been reported as written to disk are only waiting to be written to disk, and if you don't power up the hard-drive before the battery runs out protecting the cache, you risk corrupting your data.)

    As for "read-behind caches" (i.e. reads to data based on requests you're going to receive, not based on requests you've already received), isn't really feasable.

    Note: feel free to correct me, I'm no hard drive expert.
    • OK, 3-State Bit sez:

      For example, suppose you are trying to read some data that's almost at the edge of the outer ring, but that you issue your request immediately after the read arm has hit the edge outward and started going inward, having already passed the data you need. At this point, a simple seek would be almost instantanious, since you could just move back to where you needed to be -- but under your "continual motion" scheme, you would need to wait all the way until the arm travelled to the inside of the platter, then all the way until it travelled back to where it needed to be again.

      Which pretty much sums up why elevator seeking would be a huge waste for a single process. Meanwhile other people remember that Novell had (has?) it and that it did improve average seek times. The key is that elevator seeking maximizes overall I/O throughput when you have a whole lot of processes independently accessing different areas of the disk. That's why it was added to Novell, back in the beginning of PC time when Novel was the best dang file server on the planet but nothing more than a file/print server.

      I could also imagine that it could be useful for a database, where you have the log on a separate volume, either with no elevator seeking or with it effectively defeated by constant fseeks, and the data and indices on volumes with elevator seeking on. Remember data into the log is typically faster than writing it into the database precisely because it requires less seeking...
  • It's called elevator seeking, and Novell Netware 6 and later already does it. Or so my Novell class teacher said.
  • A process where the hard disk read-write head picks up data in the direction it is traveling across the disk, rather than in the order data is requested. In this way, disk I/O requests are organized logically according to disk head position as they arrive at the server for processing. This reduces back-and-forth movements of the disk head and minimizes head seek times.
    --Snipped from the Novell knowledge base []

    Elevator seeks reduce the average wait time for data to be returned and therefore increase performance. But as you already realized it has to understand the underlying geometry of the drive.

  • FYI ...

    Silberschatz and Galvin's "Operating System Concepts", 5th ed, covers this topic in detail in section 13.2 ("Disk Scheduling"). There, they cover FCFS, SSTF, SCAN, C-SCAN, and LOOK hard disk scheduling.

    It was my impression that most newer hard drives actually had this capability built into firmware, but an OS could manage it too, as S. & G. suggest. Don't know much more about it than that, but this may give you some inroads on where to search for more info, if you're interested.
    • i didn't realize silberschatz and etc covered anything in detail in that book.

      (after reading the first several chapters, i stopped reading though, cause i felt i was having negative knowledge flow)
  • had this neat technology where they sorted the write jobs by cylinder so the disk head made fewer tracks back and forth on the disk. Elevator writing or some such they called it. I used to wonder why ALL disk IO wasn't like that.

    Then I got old and cynical and stopped wondering...

    • well, most(all?) modern disk schedulers do use an elevator algorithm, but you wouldnt want you elevator algorithm to absolutely strict, because then if you had three write opperations going, a big, a small one located near the big one, and a small one far away from the big one, and they were all made at the same times, it could start on the big one, write some of the small one, write more of the big one, finish the small one, work one the big one, and all that time the other small block hasnt been written out to disk and is causing your mail server to block on an fsync. So elevator algorithms are use, they just arent absolute elevators, they make some account for order of writes.
      • In the same vien as your excellent comment, why is it that in MS Windows 95/98 there is a default for simultainous buffer commits that cause the disk heads to go crazy?

        When I turned that off I got a certain Clarion report to go from 45 minutes of thrashing to 5 SECONDS of purring...

        • In the same vien as your excellent comment, why is it that in MS Windows 95/98 there is a default for simultainous buffer commits that cause the disk heads to go crazy?

          When I turned that off I got a certain Clarion report to go from 45 minutes of thrashing to 5 SECONDS of purring...

          Can you provide some more info on this? I'm interested in what you did.
          • I've lost my source of this, but AIR under computer properties | file systems | troubleshooting - disable simultaneous buffer commits is an option in win 95/98. Look around for it.

            Check the box and reboot - see - under the default it tries to write multiple files at once, each fragment requiring a FAT update, without it it does one file, updates FAT, does another etc....

        • I hope you don't mean clariion. e.g. you're running Win95/98 on expensive EMC storage stuff...
    • I remember BSD 4.3 (Score:2, Informative)

      by lpq (583377)
      It wasn't something so mindless as an elevator that can only go one direction...They had a concept of "zones".

      The theory is you divide up the disk into cylinders or groups of cylinders (depending on your cylinder size). When you write files out to disk, the directory and the files for the directory were all aimed at the same zone or a closest adjacent/available zone.

      The idea was to use people's logical mapping as a guideline in physical layout on the disk. That way seeking is minimized within some small number of cylinders.

      You, maybe, have to hold off on writes to allow them to group to be efficient, or worst case writes involve much seeking if different processes are writing single blocks to different directories with each write, but reads go much faster...

      Maybe you would wait around after a read at the same area of the disk while the process that was just 'awakened' by the finishing of the disk I/O finishes it's timeslice. If it doesn't get scheduled next or isn't likely to within "X" time, or if one accumulates other read requests that are more than "Y" time old, then you services the other processes.

      Values of "X" and "Y" depend, on disk seek times/latency, timeslice length, process priority, number of disk-blocked processes...etc. They'd have to be dynamically altered based on load.

      But that's what...circa 1989 tech? Good thing we've come so much farther than those primitive algorithms. :-(

      Those who do not learn from history are doomed to repeat it.

      It's amazing how, even now, many "modern" security concepts are being "discovered" that were already known back in the 60-70's.

      Maybe it's the tendancy in the industry to discard experienced programmers for cheaper-meat so generational knowledge that can be passed on over a generation (~20-30 years?) is lost as the "generation" time is shorted to 10-15 years. Dunno. Maybe modern programmers don't have time to read books on what's already been done because they know that whatever has been done in the past, they can do better (i.e. Egoism ignorance)?

  • Wouldn't that cause extra wear on the drive?
    • Only if you're not actually accessing the drive much (in which case elevator seeking like this would likely be a bad idea). Otherwise, it would probably cause less wear and tear as you're not constantly changing the acceleration of the actuator and subjecting it to stress. It's more likely that your spindle will seize up long before the actuator dies, though.

      As for the spindle in the drive wearing out, it doesn't make much difference either way. The platters are spinning at all times anyway unless you specifically spin down the drive.
  • by deviator (92787) <bdp&amnesia,org> on Monday March 03, 2003 @10:52PM (#5429860) Homepage
    The technology the poster is referring to is called "Elevator Seeking" and was originally included in early versions of Novell Netware (and is still in current versions, as it really does improve access time.) Here's Novell's official definition. []
    • by Anonymous Coward
      "Originally?" You've got to be kidding. This goes back to the first disk drives, long, long before Novell.
    • I think I've seen this at work many years ago. I remember watching an array being operated by Novell 4.something. No matter what time of day or night (which dramatically varied the load) the lights on the array would pulse simultaneously about once per second. I surmised that the OS was caching writes in RAM and flushing the data to storage asynchronously according to it's scheduler.

      One thing (the only thing afaict) about Novell; I don't know of any other OS that could provide multiple filesystem mounts to hundreds of PCs from a 386 with 128MB of RAM. Novell did this routinely.
      • still does - albeit from a PIII or better with Netware 6.

        Netware 6 still kicks some serious ass - it's easy to get limited-user eval versions to mess around with... it's still the only server out there designed from the ground up specifically to be JUST A SERVER. :)

        Supports any type of client these days, too. Comes w/ Apache preinstalled, even... the list goes on.

        I still think it's fun to play with.
  • Geometry (Score:5, Informative)

    by mmontour (2208) <> on Monday March 03, 2003 @11:07PM (#5429947)
    My question is simple, now that disks are IDE, have lots of heads and even worse differing Heads/Cylander/Sector translation scemes is this type of system even possible?

    I wouldn't say that disks have lots of heads - 2 to 4 is probably typical, and there are models that only have 1 (e.g. Maxtor "Fireball 3" series).

    Don't pay any attention to the BIOS geometry that claims 16 heads - those numbers are pure fiction. In fact the whole concept of the C/H/S geometry is obsolete. It assumes that the number of sectors per track is constant, when in fact it varies. Outer tracks have a larger circumference, so they can hold more sectors. A drive might have 30 different zones, each with a different number of sectors per track.

    Tools like "zcav" (bundled with the "bonnie++" benchmark utility) will show this quite clearly, because the sustained transfer rate of a disk is proportional to the number of sectors per track. The rotation rate is constant, so more sectors-per-track means more sectors-per-second passing under the head.

    Drives tend not to expose the information you need to translate logical block addresses into physical locations. Apart from the zones, you also have the problem of re-mapped sectors (where a spare physical sector is substitued at the logical address of a failed one).

    p.s. "drivers/block/elevator.c" in the Linux kernel might be of interest.
  • You can increase your throughput by taking advantage of the drive heads' positions en route to do something else (seek-wise, rotational latency-wise too, apparently). It's called freeblock scheduling. An introduction is online [], and other papers are around; this one makes reference to a 300% speedup when there is no idle time on a drive.
  • Many of the BSDs do this in the background unless told not to, always reading/writing forward to the next logical sector in the queue, then skipping all the way back to the earliest sector in the queue when there are no more later sectors available before a sync barrier. If you don't turn around and operate backwards, you don't have to worry about the logical versus physical arrangement. I don't know what Linux does, but I'll guess it's got similar.

    By the way, do remember that drives don't actually have so many heads as you see in the bios. The values the BIOS reports rarely have anything to do with the real hardware anymore. This is just left over from the days when you had relatively low track limits, and so drives would lie, doubling or quadrupling heads and cutting track counts in half or quarter.

  • All the drives on the market today do internal translation from the C/H/S numbers they advertize to their own internal structure. So I think that implementing elevator sorting or other such algorithms at the OS level is a waste of time unless you know the mapping method (which isn't going to be the same on all drives). You can guess that increasing linear addresses correspond to higher cylinders but even if that were true, you can't tell where the boundaries are, since not all cylinders are created equal.

    Furthermore, drives have large caches today which depending on how they are used (and I imagine they are used in write-back mode mostly) effectively hide the disk operations from the OS.

    I would assume that this kind of stuff should be performed in the drive's firmware anyway, if it's not already. If the communication protocol supports sending multiple commands to the drive (I think SCSI does that, not sure about IDE) then the drive could reorder the requests to make better use of the drive geometry.

  • by tlambert (566799) on Tuesday March 04, 2003 @03:39AM (#5431140)
    First use of this technology... 1984:

    A Fast File System for UNIX (1984)
    Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler, Robert S. Fabry tml ...otherwise known as BSD FFS.

    Current BSD's have this capability, but it is generally disabled, because modern disk drives *lie* about their physical seek boundaries.

    Theoretically, you can work around this with SCSI disks by reading the physical geometry off of mode page 2, and then taking it into account when laying out data, to avoid seeks. Maxtor also has a vendor private command for getting this information from some of the more modern ATA drives.

    The BSD FFS code can't handle this information without work, though, because the code is very simple, and supports only the idea of uniform length tracks, and does simple math, rather than a table lookup (but it's not that hard to change).

    Practically, you could expect a significant speedup, now that the relative spindle speed vs. seek speed makes seeks significant on 10,000 RPM drives; for most drives for the last 5-6 years, though, the seek is in the noise, and it's not that big a win (stepper moters vs. voice coils was the big change that made it matter much less, the first time).

    -- Terry

    • 1984? Hardly. Concurrent Computer had this technology available on its line of minicomputers since the late 70's. They called it "sweep" instead of elevator, but the concept was the same.

      On a related theme, I seem to recall that Tivo wrote their own seek scheduler for their PVRs - you might want to check out their source to see if it's "elevator" or not.

  • by one-egg (67570) <> on Tuesday March 04, 2003 @03:50AM (#5431172) Homepage
    Contrary to what other posters have said, the original questioner was not asking about disk scheduling algorithms such as SCAN (elevator) or C-SCAN. Rather, the system he was recalling was the Log-Structured Filesystem [] (LFS) from Berkeley. The original work was done in the early 90's. The basic ideas were as follows:
    1. Most disk activity is reads.
    2. If you have a lot of RAM, caches do a good job of taking care of the reads.
    3. The leftover writes drive the head crazy.
    4. It is therefore a Good Idea to do the writes whereever the head happens to be, and let the disk be scrambled.
    The LFS operated by creating a "log" in which all blocks were written sequentially. Reads required random seeks, but the cache was supposed to take care of that. Eventually the log filled, after which a cleaner (which ran in background) would recover the blocks discarded by deletes, and those blocks would be reused.

    The original work was done by Mendel Rosenblum, one of the founders of VMware and the most recent (2002) winner of the ACM SIGOPS Mark Weiser award.

    The problem, as it turned out, was the cleaner. It put too much load on the disk. The original theory was that the cleaner would run overnight, but on a continuously loaded system there was never idle time to use to run it.

    In 20/20 hindsight, the idea was clearly flawed. If you look at my list above, you'll see that you are getting rid of scrambled writes by giving up sequential reads. Since reads are cached, you're (on average) giving up 1 approximately sequential read to get 1 sequential write. But that's wrong because occasionally the cache misses, so instead you give uyp 1.1 (or 1.001) sequential reads to earn 1 sequential write. Worse, you also have to pay overhead to the cleaner.

    I can argue strongly that the only reason LFS ever saw the light of day was that the benchmarks used to evaluate it wound up highlighting its strengths and hiding its weaknesses. I don't think that was intentional, but it's what happened.

    The most recent LFS work was by Drew Roselli, in the late 90's. She identified a lot of the causes of slowdowns in the original system, and found ways to mitigate them. Even so, though, the system has never lived up to its promise.

    BTW, don't confuse LFS with journaling filesystems such as ReiserFS, XFS, and ext3. LFS had some journaling aspects, but its focus was performance rather than crash-proofing. One can argue that LFS influenced journaling filesystems, but it's not the same.

    • # The leftover writes drive the head crazy.

      I don't understand this one. Surely if you have a lot of RAM and a write back cache, the writes can wait indefinitely for the head to be in a convenient location.
      • I overstated my case.

        You can't wait indefinitely, of course, for two reasons: first, there's a limit to the amount of cache, and second, you want to sync stuff to disk for reliability. But I would have been more accurate to say that the leftover writes cause more head motion than absolutely necessary.

  • by AlecC (512609) <> on Tuesday March 04, 2003 @09:12AM (#5432043)
    Most Scsi disks do this sort of think internally if you give them lots of transfers to do. Furthermore, each drive knows about its own geometry and knows how to optimise for itself. This is significant, because the C/H/S trector tradeoffs are not what they used to be. Conventionally, a headswitch (changeing head on same cylinder)might be expected to be chapeaper than a short seek (changing cylinders on the same head. In fack, the skew between the head amounts to several tracks, so a head switch always includes a seversl-track seek, and it is porribly better to do another transfer that appears to be further away. And the drive knows what its *actual* rotational position is to optimise out rotational latency, not a theoretical position from a model which may not match the actual drive.

    So if you have a Scsi drive, just give everything to the drive and let it get on with the job. I believe SATA drive also have the theoretical capability to to this, but most of them don't actually provide it yet.

    When you have given what you can to the drive, think about multiple drives. Mirrors. Raid4/5, with the overhead of parity generation. Multiple filesystem vs striping.

    Get coffee..

  • WAFL from NetApp? (Score:2, Interesting)

    by reuel (166318)
    Perhaps you are thinking of the Write Anywhere File Layout (WAFL) from Network Appliance. The situation is different for a network storage box because of large write-through caches on all the clients: most of the traffic the box sees are writes. This situation combined with a stripped level 5 RAID array where small writes are really expensive leads you to the battery-backed RAM and WAFL file system described in a paper by Hitz, et al:
  • You can't really do this sort of thing at the filesystem or device driver level anymore because of the internal drive geometry translation.

    SCSI drive have been doing it internally for a long time now. IDE drives are just starting to get this working. Link []
  • Is there any utility to build a pratical C/H/S - LBA map ?
    Accessing the disk through LBA at low level with precise timing let you discover when heads seek. Thus you can compute actual C/H/S geometry and the mapping to LBA. Write once the map to a file along with disk id. Use the map with ad-hoc driver which optimize reads and writes accordingly.
    Compute this map will be time consuming but it need to be done only once.

"Atomic batteries to power, turbines to speed." -- Robin, The Boy Wonder