Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Improving Unix Mail Storage?

Posted by Cliff on Tue May 28, 2002 10:17 PM
from the we-can-improve-him-we-have-the-technology dept.
At first, there was mbox, then there was Maildir, and Bill begat Outlook and .mbx. CaraCalla wonders if there is a better way to store mail than the way we currently store it today. I admit, with the changes that email has undergone over the past 5 years (changes in what is being sent, not necessarily in how it is sent), it may be time to reinvent the mail format. Read on for CaraCalla's analysis of the current mail options, and his thoughts on where we may go in the future. If you were to design your own MUA, how would you design its mail storage?
CaraCalla asks: "Does anybody know a good, free solution for storing mail on unix hosts? The reason that I ask this question is my discontent with available techniques:
  • mbox: There are problems with locking, corruption, access-times, and bloat.
  • Maildir: Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
  • Cyrus: Basically the same as Maildir with database features.
  • UW-Imap mbx: That's classical mbox with extensions allowing multiple access.
  • Evolution: Basically mbox with database features.
  • Windows clients: Typically some proprietary db-format. Pathetic.

But the thing that bugs me most is disk space. Typical inboxes are made of 5% to 10% of Text including Headers and HTML. The rest are BASE64- (or UU-) encoded pictures, word documents, zip archives and so on. The problem here is the encoding which wastes considerable amounts of space (at least one third).

Some ideas about the ideal mail-storage:

  • One file per Mailbox-folder, allowing multiple folders per user. Should those files reside in one central location or in users Homedirs?
  • Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?
  • File format: gdbm, Sleepycat db? Something new?
  • Should the security model allow users to directly access their files, grep them, copy them around?
  • Shared folders, virtual domains?
  • Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?
  • How would MTAs deliver mail? How would clients access? File-locking (NFS)?
  • What about backwards-compatibility? Writing libmailstore (anyone)? adopting UW c-client?

Does my ideal mailstorage exist somewhere? Is somebody working on a project addressing this? Does anybody have some other hints? And please no mbox/Maildir flamewar!"

This discussion has been archived. No new comments can be posted.
Improving Unix Mail Storage? | Log In/Create an Account | Top | 615 comments (Spill at 50!) | Index Only | Search Discussion
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1) | 2
  • One folder to rule them all... (Score:5, Interesting)

    by Pig Hogger (10379) <pig,hogger&gmail,com> on Tuesday May 28 2002, @10:22PM (#3599830) Homepage Journal
    Stuff the mail in one folder, one to rule them all.

    But put multiple indexes (by sender, subject, date, whatever key-classes you want to assign messages) and the possibility to restrict the range displayed. With careful programming, you can manage many users who won't be able to read each other's mail, except as required.

    This way, you can arrange your mail as you please.

    No more message duplication. Send a memo to 250 people? Just send it once, but tag it as readable by the 250 sendees.

    Of course, this calls for an SQL database... :) :) :)

  • Improving the Unix filesysten in general by line-bundle (Score:1) Tuesday May 28 2002, @10:23PM
  • hmmm by Cyno (Score:1) Tuesday May 28 2002, @10:25PM
    • Re:hmmm by ewoods (Score:1) Tuesday May 28 2002, @10:28PM
      • Re:hmmm by Cyno (Score:1) Tuesday May 28 2002, @10:38PM
        • Re:hmmm by fejjie (Score:1) Wednesday May 29 2002, @02:32AM
          • Re:hmmm by fejjie (Score:1) Wednesday May 29 2002, @02:34AM
          • Re:hmmm by AdTropis (Score:2) Wednesday May 29 2002, @10:28AM
        • 2 replies beneath your current threshold.
  • why use a 'file' at all? by nullchar (Score:1) Tuesday May 28 2002, @10:26PM
  • My fast, easy solution. by crazney (Score:2) Tuesday May 28 2002, @10:26PM
  • yEnc by Glytch (Score:2) Tuesday May 28 2002, @10:29PM
    • "Why yEnc is bad for Usenet" by Wyzard (Score:2) Tuesday May 28 2002, @11:18PM
      • 1 reply beneath your current threshold.
    • Re:yEnc by fejjie (Score:2) Wednesday May 29 2002, @12:55AM
      • Re:yEnc by erc (Score:1) Wednesday May 29 2002, @11:12AM
        • Re:yEnc by fejjie (Score:1) Wednesday May 29 2002, @05:29PM
      • 1 reply beneath your current threshold.
    • 3 replies beneath your current threshold.
  • Database storage by AngusSF (Score:1) Tuesday May 28 2002, @10:29PM
  • I like MS Exchange (Score:5, Informative)

    by alen (225700) on Tuesday May 28 2002, @10:29PM (#3599875)
    A single database to hold of the user's email. Single instance storage ensures that only one copy of any attachment is in the database at once, no matter in how many email messages it was sent in. API's for back up let you back up the whole database or individual mailboxes. And depending on your backup solution you can restore mailboxes and individual emails. Anti-virus software that integrates into the server side of the software. In Exchange 2000 if you accidently delete a mailbox you can easily bring it back with all emails without restoring from tape. Only files to worry about on the user end is a personal address book and archived email. Unless you use POP3 or it's archived in personal folders the email always stays on the server preventing problems like accidentaly downloading important emails you need at the office being on a home PC. And it's stable. Not as stable as UNIX I admit, but it stays up for months without a reboot. And in my experience most problems are solved by a simple reboot. In 4 yeas of exposure to exchange, the only non-admin related problems I've seen were 1 database corruption where I needed to run a utility and wait 45 minutes for it to work again. And a corrupted MTA that needed a reboot to get it working right again.
  • MSFT will save us! by Boulder Geek (Score:1) Tuesday May 28 2002, @10:29PM
  • XML all gzipped up by global_diffusion (Score:1) Tuesday May 28 2002, @10:34PM
    • XML = unnecessary performance hit by acb (Score:2) Wednesday May 29 2002, @12:18AM
      • Re:XML = unnecessary performance hit (Score:5, Informative)

        by spectecjr (31235) on Wednesday May 29 2002, @03:37AM (#3600799) Homepage
        There should be a standard byte-compiled representation of XML (CXML), which has been flattened into an easily readable data structure. It would be portable, with byte orders indicated in flags (or would just use network byte order, i.e., big-endian), and with fixed-length element start/end headers, and could be used in lieu of XML for machine-machine communications. If a human wants to inspect the data, XCML could be trivially converted to and from XML.

        There is one. It's called ASN.1

        Simon
        [ Parent ]
      • Re:XML = unnecessary performance hit by jbert (Score:2) Wednesday May 29 2002, @04:24AM
    • Re:XML all gzipped up by fejjie (Score:1) Wednesday May 29 2002, @01:02AM
    • Re:XML all gzipped up by ObviousGuy (Score:2) Tuesday May 28 2002, @10:53PM
    • 3 replies beneath your current threshold.
  • by Dr. Awktagon (233360) on Tuesday May 28 2002, @10:35PM (#3599900) Homepage

    Something like Maildir .. if the FS is slow and can't handle that kind of application, then we need to improve our filesystems!

    Lots of applications need lightweight databases with indexes, locking, and atomic operations. Why not bake this into the filesystem, and it won't have to be just for email, it will have many uses.

    I was thinking about this the other day as I was working on a logging system for a large in-house email filtering system.. similar problem, except instead of storing emails, I'm storing small XML fragments describing the structure of each email and what was done to each. So far the easiest solution was large monolithic XML files, and an external index pointing in the large file (i.e., like mbox + a DB index). As it grows we'll probably have to move it to a "real" database.

    There is a need for something like sleepycat DB + ReiserFS on steriods..

  • The Reiser guys have some ideas. (Score:5, Informative)

    by SwellJoe (100612) on Tuesday May 28 2002, @10:37PM (#3599909) Homepage
    I've followed ReiserFS development for years now, shipping our first servers with it some two years ago (and every box we've shipped since then), and I believe they have the best long-term plan for this kind of thing. Hans has written some excellent white-papers on making small files extremely cheap.

    The eventual goal of Reiser is a filesystem that is indistinguishable from a powerful database (if a special purpose database). The plan is to make small files so cheap that every extension of a file, directory, etc. is just another file. Another interesting turn is that files would no longer be, necessarily, of the form '/big/long/path/to/some/file'...because the filesystem is a database, one could also access it by a category, so that one file read pulls in all of the data of that category (from any number of files). Directories become just one view of the data available, with any number of other views possible depending on the application.

    As was mentioned in the parent, this would lead to things like 250 email recipients and only one actual file. But of course, this leaves out the copy-on-write functionality needed to make this seamless.

    So I think the solution is probably to fix the filesystem--not to fix the email storage mechanism. A number of very smart people have 'fixed' email storage in the past, leading to all of the options we have today, none of which works extremely well on really large mailboxes. Yes, many are good enough, and many work fabulously for small to mid-sized applications. But the day will come when they do not work so well, due to the higher volume and growing average size of emails.

    A good place to start for information about these ideas (which are primarily a consolidation of the most interesting research in the field of filesystems and databases):

    http://www.namesys.com/whitepaper.html

    ReiserFS is good stuff. Give Hans' papers a read sometime.

    BTW-Don't gripe at me about ReiserFS instability, etc. I know better. As I mentioned I've been shipping servers with it for 2 years, and we've never had a single ReiserFS-caused corruption. Not one.

  • Parroting the masses... by Webratta (Score:1) Tuesday May 28 2002, @10:39PM
  • Portability by Pretzalzz (Score:2) Tuesday May 28 2002, @10:41PM
  • Something to keep in mind... (Score:5, Insightful)

    by cwinters (879) on Tuesday May 28 2002, @10:43PM (#3599941) Homepage

    /. punchingbag jwz has some strong [jwz.org] opinions [jwz.org] about using databases (etc.) for mail storage. I tend to agree: everything can read from and write to files, there no versioning issues, they can be easily transported among different operating and file systems, they can be backed up easily. But it's another wheel to reinvent, so everyone hop to it at once and then lose interest in two or three weeks!

  • Quantum-like Storage (Score:3, Funny)

    by cybermage (112274) on Tuesday May 28 2002, @10:43PM (#3599945) Homepage Journal
    I've been joking for years about getting two shell accounts on opposite sides of the planet and setting each up with procmail to bounce all my mail between the two (always rewriting the header so as to avoid a loop.) I figure at any given time, my mail would be in both places and neither simultaneously.

    If I want to read some, I'd just chmod .procmailrc for a few seconds and change it back. Plenty of mail storage without chewing-up precious file-system quota.
  • Eudora by cpaluc (Score:1) Tuesday May 28 2002, @10:45PM
    • Re:Eudora by kentborg (Score:1) Wednesday May 29 2002, @01:04PM
    • 1 reply beneath your current threshold.
  • Eudora mbox (Score:4, Informative)

    by 1u3hr (530656) on Tuesday May 28 2002, @10:47PM (#3599965)
    Eudora (Win and Mac)handles encoded attachments by decoding them and storing them in an attachments folder, replacing the encoded text in the mesage with a line like

    Attachment Converted: "C:\EUDORA\ATTACH\NEW YORK.pps"

    Click on that in Eudora and the attachment opens.


    This keeps the actual text in the mbox file lean. I've got almost a decade of correspondence that totals about 20 MB, if it included all the attachments it'd be much more.

    Also it allows you to edit messages after receipt, (this might trouble some people, but it just simplifies what I used to do by opening the mbx file in a text editor). I can select all the text, then paste it back in. This has the effect of removing all the HTML coding that is especially crufty from Word generated mail -- a 20k message reduces to 1k.

  • Compression. by TellarHK (Score:2) Tuesday May 28 2002, @10:50PM
  • Well.... by BJH (Score:2) Tuesday May 28 2002, @10:51PM
  • About 1.4 seconds? by ryochiji (Score:1) Tuesday May 28 2002, @10:52PM
  • SQL by Mowog (Score:1) Tuesday May 28 2002, @10:53PM
  • reiserfs+Maildir by {X-Frog} (Score:1) Tuesday May 28 2002, @10:55PM
  • one file per message (Score:3, Insightful)

    by g4dget (579145) on Tuesday May 28 2002, @10:56PM (#3600018)
    One file per mail message is the right thing to do. That lets you use standard UNIX tools for manipulating mail and it gives you convenient locking semantics. And the hierarchical UNIX file system structure, together with links, matches mail semantics nearly perfectly.

    Of course, with traditional UNIX file systems, this is a bit slow. The thing to do is to fix the file system, not to kludge ever more complex mail formats on top of it. ReiserFS goes much of the way; we now also need some system calls to open and read multiple files with a single call.

    Until file systems catch up, one kludge is as good as another. UNIX mbox format is at least simple, so I stick with that.

  • Has anyone ever tried the XML approach? by jaaron (Score:1) Tuesday May 28 2002, @10:56PM
    • 1 reply beneath your current threshold.
  • We need an XML standard to move mail around by astrashe (Score:2) Tuesday May 28 2002, @10:57PM
  • Maildirs (Score:4, Informative)

    by mrsam (12205) on Tuesday May 28 2002, @10:57PM (#3600026) Homepage
    Maildir : Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi) ...

    In case you haven't noticed, the default settings for the Linux ext[23] filesystems is to allocate one inode per 4096 or 8192 bytes of disk space. Which happens to be pretty much the size of an average E-mail message. So, in other words, you are unlikely to run out of inodes before you run out of disk space, since both are going to be used up pretty much at the same clip.

    It may come as a shocking surprise to some, but the average large filesystem is just littered with small files here, and small files there, all over the place. Here's my workstation -- a fairly large box with all sorts of crap loaded:

    Filesystem 1k-blocks ...
    /dev/sdb5 8159388 ...

    Filesystem Inodes ...
    /dev/sdb5 1036288 ...

    I'm using up almost exactly 8192 bytes per inode.

    and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.

    How about instantly? Most GUI E-mail clients cache mail headers, so they don't have to go and wait for the server to reply each time you click on the folder index window to re-sort, or scroll the folder index.

    ...

    Some ideas about the ideal mail-storage:
    * One file per Mailbox-folder, allowing multiple folders per user.


    Using one file per folder essentially forces you to use some form of locking each time folder access is necessary. Locking of any sort has been problematic for years whenever NFS (or pretty much any other network filesystem) is involved. A single circuit will now take out your entire network spool, as all clients are now spinning on lock requests out on the unreachable server.

    Compression: Should messages be broken into pieces and the MIME-attachments stored separately (thus searching of the text parts would still be possible without decompressing the whole file)?

    I thought you wanted to save everything in a single file per folder, and using multiple files for messages is supposed to waste inodes, remember?

    File format: gdbm, Sleepycat db? Something new?

    Ask an Exchange admin about joys of a corrupted Exchange database. If mail are stored in simple, plain, files, a single instance of corruption will affect at most one mailbox, instead of taking out the entire monolithic database.

    Unicode support in folder names? Imap message-IDs, flags, useragent specific state-information?

    IMAP already uses Unicode to encode folder names. Not sure what "useragent specific state-information" means...
  • mbx by zsmooth (Score:2) Tuesday May 28 2002, @11:01PM
  • Oh great (Score:3, Funny)

    by elefantstn (195873) on Tuesday May 28 2002, @11:03PM (#3600046)
    If you were to design your own MUA, how would you design its mail storage?


    Now I'll never get to sleep tonight.
  • BeOS by mlk (Score:1) Tuesday May 28 2002, @11:07PM
  • Compression is nice, but... by pHDNgell (Score:1) Tuesday May 28 2002, @11:10PM
  • Spam Assassin!!! by swimfastom (Score:1) Tuesday May 28 2002, @11:11PM
  • Maildir with thousands of emails = fast by Sosarian (Score:1) Tuesday May 28 2002, @11:12PM
    • 1 reply beneath your current threshold.
  • The only real answer by quantaman (Score:1) Tuesday May 28 2002, @11:13PM
    • 1 reply beneath your current threshold.
  • Oracle's Solution by hspatel (Score:1) Tuesday May 28 2002, @11:18PM
  • Maildir access time by SealBeater (Score:1) Tuesday May 28 2002, @11:26PM
  • Citadel uses Berkeley DB by IGnatius T Foobar (Score:2) Tuesday May 28 2002, @11:36PM
  • don't forget (Score:3, Funny)

    by Hollins (83264) on Tuesday May 28 2002, @11:40PM (#3600175) Homepage
    You're newfangled system better have DRM built in. I don't have any data, but it must be obvious that artists are losing billions in revenues every week due to mp3s being sent as attachments. This criminal behavior must be stopped or the practice of free expression will come to a screeching halt.
  • We have to have the remote hammer to pop out of the monitor to whack the end user. This is a must for any admin that works with more than 300 people. Hammer trigger from e-mail, pager, SMS, or telephone number.

    Power mains must be connected to the user's chair, see above for trigger.

    The MUA must forward all p0rn to the admin account. Likewise with credit card info.

    The MUA must know when the user is about to do something to tick off the admin, like sending a "me too" to everyone in the office, or replying to a confidential e-mail to the whole office and prevent the user from reproducing. X-rays are fine for impromptu sterlizations. The side effect of loosing all your body hair is no problem, as it alerts others to a stupid co-worker.

    The MUA must alert the admin when a coworker he has got the hots for changes her home phone number. Just to be fair, if the the admin is female, the reverse applies.

    The MUA must analyze the admins e-mail and throw a bucket of cold water if (s)he attempts to send a really stupid e-mail.

    Also, the MUA must be able to launch nuclear missles at spammers automatically. After that, it should refer the e-mail to the admin to see if a stronger response is warranted. Better yet, the MUA should employ a time machine to go back and choke the spamming creep when the spammer is still a baby, then use X-rays on the parents as above.

    The MUA should have a hypnotic effect on the object of the Admins desire and cause that person to preform disgusting oral acts on the Admins body each time a new e-mail arrives. (HOORAY FOR KELZ!)

    For the PHB, he should (by the same hypnotic effect) do a "Full Monty" when the big cheese walks in. Twice.

    The MUA should be able to cause back dated confirmation messages from HR approving a 51 week paid vacation upon pressing a special key combination, unless it's the PHB pressing the keys, then it should cause an e-mail to be sent to HR from the PHB's account turning in notice.

    Sorry, if you had a day like mine, you'd need a laugh about now...

    • 1 reply beneath your current threshold.
  • Maildir? by Dionysus (Score:2) Tuesday May 28 2002, @11:41PM
    • Re:Maildir? by Metatron (Score:1) Wednesday May 29 2002, @07:01AM
  • Databases Ptewey. (Score:3, Insightful)

    by Jason Pollock (45537) on Tuesday May 28 2002, @11:46PM (#3600196) Homepage

    The problem with using database formats is that you can't access them with vi. How many times has your mail client crashed attempting to read an email, but you still _need_ to get access to it? If it's in a database (proprietary or not), you're up the creek. If it's stored in a flat file, you at least have the option of using vi/emacs/grep to find and read the email, and then excise it.

    This has happened to me in Netscape, Kmail, Outlook, Evolution, Eudora, etc. Every single one has had problems at one point or another. The best programs are the ones that are _truly_ open, and let you get at the mail from other directions.

    Don't doubt the power of the text utilities in Unix. :)

    Jason Pollock

  • Perfect file format by kinthalas (Score:1) Tuesday May 28 2002, @11:46PM
  • Unencoded Attachments. by sPaKr (Score:1) Tuesday May 28 2002, @11:52PM
  • The problem is with pack-rats... by doomdog (Score:1) Tuesday May 28 2002, @11:54PM
  • Why not improve mail, period? by mveloso (Score:1) Tuesday May 28 2002, @11:55PM
  • S.O. by mirko (Score:1) Wednesday May 29 2002, @12:05AM
    • Re:S.O. by NotZed (Score:1) Wednesday May 29 2002, @01:20AM
  • The trouble with e-mail by os2fan (Score:1) Wednesday May 29 2002, @12:10AM
  • Experiences with Mail Storage by Da_man (Score:1) Wednesday May 29 2002, @12:25AM
  • How about usenet? by thogard (Score:2) Wednesday May 29 2002, @12:26AM
  • I vote for plain mbox by Trevin (Score:1) Wednesday May 29 2002, @12:26AM
    • 1 reply beneath your current threshold.
  • Qmail by babyruth (Score:1) Wednesday May 29 2002, @12:28AM
  • Don't speculate. Profile. (Score:5, Insightful)

    by Doktor Memory (237313) on Wednesday May 29 2002, @12:37AM (#3600378) Journal
    Maildir: Do you really want to clutter your system with millions of small files? That's waste of inodes, space (unless perhaps you use Linux/ReiserFS or SGi)
    Psssst. It's not 1978 any more. Inodes are cheap. So is disk space. Stop spreading FUD.
    and just try to open a Maildir with 1000+ mails and see how long it takes your favorite Mailprogram to only display the subjects.
    Quite right. Just try it. You might be a bit surprised by the results. [courier-mta.org]
  • by fejjie (192392) on Wednesday May 29 2002, @12:44AM (#3600389) Homepage
    A couple things:

    1. Evolution is NOT "Basically mbox with database features". It can use Maildir or MH as the backend (and you can write your own plugin to extend this if you like).

    2. Evolution's body indexing and summary files are extremely fast and efficient, about the best you'll get. I hear MySQL has text indexing capabilities that are extremely fast, but I'm not sure if they are faster than Evolution's indexer or not. Might be interesting to check this out.

    3.

    > But the thing that bugs me most is disk space. Typical inboxes are
    > made of 5% to 10% of Text including Headers and HTML. The rest are
    > BASE64- (or UU-) encoded pictures, word documents, zip archives and so
    > on. The problem here is the encoding which wastes considerable amounts
    > of space (at least one third).

    It's theoretically possible, if you wrote your own Evolution storage plugin, to change the Content-Transfer-Encoding header value of binary attachments to "binary" (and text attachments to "8bit") before writing the message out to disk (or wherever) thus magically making it so that you no longer save the encoded text of the attachments but rather in-line binary data content. (Yes, it's as easy as setting an enum value in the CamelMimePart structure).

    However, you have to be aware of the consequences of this. Most importantly, you will not be able to validate any of your PGP/MIME or S/MIME signed messages as according to the RFCs for these types, the signed MIME parts MUST be treated as opaque (meaning that you may not modify them in any way).

    Now on to your ideas...

    > One file per Mailbox-folder, allowing multiple folders per user.
    > Should those files reside in one central location or in users
    > Homedirs?

    How is this different from mbox? (btw, CVS Evolution can handle mbox files and directory trees in external locations - ie, not within the
    ~/evolution directory).

    > Compression: Should messages be broken into pieces and the
    > MIME-attachments stored separately (thus searching of the text parts
    > would still be possible without decompressing the whole file)?

    If you break apart the MIME parts, you run into the same problem I described above about not being able to verify signatures.

    However... if you took a normal mbox and gzipped it, you would certainly save space (at the expense of speed). I've been thinking about writing a CamelMimeFilterGzip class for gzip compresing/decompressing streams which would allow Evolution to read and write to gzipped mbox files for example.

    Once the class is written (which should be fairly simple), allowing Evolution to read gzipped mboxes should be as simple as doing:

    camel_stream_filter_add (MboxStream, GzipFilter);

    ...before feeding 'MboxStream' to the MIME parser.

    > File format: gdbm, Sleepycat db? Something new?

    Please not Sleepycat. If you are so sure that a generic database backend will be better than what Evolution's got, at least have the sense to use MySQL or PostgreSQL.

    I'm personally against using a generic database as a storage and heres why:

    1. The average user does not have an SQL database installed on their desktop systems, and so this is a completely rediculous dependency for them. If you think library dependencies are bad, just wait till you have to go installing, configuring, and maintaining a multi-user database running on your system. This may be fine for a company solution, but not the average end-user.

    2. I'm not too familiar with MySQL or PostgreSQL, but I recall there being problems with mailers that use SQL database backends that tried to store the content of the messages as part of the table (due to them making the size of the table too small or whatever). If you can set the size to be "infinite", then I guess that's not a problem.

    If your plan is just to have the database index the folder and actually store the contents as separate files, then you've instantly gained nothing over Maildir except that now you have a hefty database that you have to maintain and very little to no speed improvements (especially if you have a well designed/implemented summary index like Evolution does).

    The only improvements you might gain here is body indexing? As I said earlier, MySQL supposedly has a REALLY good text indexer and so it might be a little faster than Evolution's. I'm really not sure on the comparison here.

    > Should the security model allow users to directly access their
    > files, grep them, copy them around?

    Is there a reason NOT to? I don't see one. It's their mail.

    > Shared folders, virtual domains?

    This doesn't really have anything to do with folder formats and everything to do with features of the client itself.

    (Evolution can do this).

    > Unicode support in folder names? Imap message-IDs, flags, useragent
    > specific state-information?

    Unicode support in folder names I'd say is a pretty important feature. I'm not sure what you mean by "Imap message-IDs". Do you mean UIDs? Evolution, for example, has a UID assigned for each message whether it be in an mbox folder, Maildir folder, MH folder, or IMAP folder. So this isn't necessarily dependant on folder format (though it could be if you used a database backend for example, you might want a UID in the table).

    I don't feel that UIDs are a must though, but I would suggest them. They are definetely useful especially for folders that can be accessed by multiple clients at once.

    Flags are good. I'd go so far as to say a MUST have.

    As far as user-agent specific state-information, it'd be nice to not need it. But if the client needs to keep it's own info, it'd be nice to be able to map the info to UIDs and keep it's own state file somewhere else (not necessarily alongside of the mail storage).

    For example, IMAP doesn't have any means for the client to store state information on it, but that's perfectly fine. If a client chooses to
    have it's own state, then it can save it locally.

    It would be nice if the storage could handle user-defined flags/tags though. This would allow the client to extend the native features of the format (Flag-for-Followup, message colouring, etc).

    > How would MTAs deliver mail? How would clients access? File-locking
    > (NFS)?

    This is one reason to just stick with what's available :-)

    File locking is a MUST have (or a scheme to make it not needed, such as Maildir).

    --
    You know, I have one simple request...and that is to have messages with freakin' laser beams attached to their headers. Now evidently my MIME specification informs me that that can't be done. Uh, can you remind me what I pay you people for? Honestly, throw me a bone here. What do we have?
  • DB by jukal (Score:2) Wednesday May 29 2002, @12:45AM
  • just... by ComaVN (Score:1) Wednesday May 29 2002, @01:02AM
  • Usenet-style, with overview database. (Score:4, Interesting)

    by strredwolf (532) on Wednesday May 29 2002, @01:03AM (#3600443) Homepage Journal
    Plain and simple. Switch from mail to Usenet. Maildir-like structure, but with a .overview (XOVER) file to help out with indexing.

    Storage is another problem, though... but Usenet messages can be sidetracked a bit with the encoding.
  • What's Right with Maildirs by Ekman (Score:1) Wednesday May 29 2002, @01:30AM
  • Exchange, Notes, GroupWise, etc... by deviator (Score:1) Wednesday May 29 2002, @01:36AM
  • Problems & Ideas (Score:4, Informative)

    by Twylite (234238) <twylite.crypt@co@za> on Wednesday May 29 2002, @01:53AM (#3600587) Homepage

    Oh dear, another file format debate. I'm glad there was a library suggestion though ... that allows us to change our mind when we do it wrong the first time ;)

    First, you need to consider the possibilitiy of moving the mailbox. To a different computer, or a different platform. This means it must be easy to access in any environment, and the tools must be portable.

    This doesn't completely rule out a database solution (like mySQL), but it certainly makes it less-than-ideal.

    Second, having used many mailers which separate out attachments ... Please Don't Do It! You can't easily move your mailbox, because there are a host of associated attachment files. There is ALWAYS a synchronisation issue between attachments and messages, so you end up scanning and cleaning out the attachment folder every so often to prevent dead files from accumulating.

    Compression is nifty, but isn't really important. Disk space is seldom a concern these days, and the really big stuff (binaries) is often already compressed or don't compress well.

    The real issue with most mailbox formats is how do you deal with the problem of removing dead space from the mailbox? Some program just leave it there until you hit "compact", which is wasteful and confuses users. Others rewrite the entire mailbox every time, which causes the software to "hang" for a while on shutdown.

    The best suggestion I can come up with off the top of my head is this: One file per mailbox folder, and that file is its own filesystem. The "root node" contains a group of summaries (from, to, subject, date, etc) and node links. Other nodes are chained to contain the message and attachments.

    Handling attachments: attachments are separated out and stored as binary in the mailbox. This conserves space but keeps the attachment with the message.

    Compacting: is avoided. When a mail is deleted, it is merely flagged in the root node (index). So each mailbox has its own deleted items folder, so to speak. When the deleted items folder is empties, the index is rewritten and nodes freed - every node not at the end of the file is overwritten with a node from the end of the file (and appropriate reindexing done), so the file is automatically compacted.

    Ideally the file needs some sort of transation logging area to ensure its integrity at all times.

    Shared access to files is best handled through a library or a service. File locking is notoriously prone to bugs and security issues, and avoiding multiple implementations in different mail clients would be beneficial.

  • /var/spool/mail/martin is my friend (Score:3, Informative)

    by martinflack (107386) on Wednesday May 29 2002, @02:00AM (#3600602)
    For Pete's sake, leave mail alone. If I can't fix it in less than 20 minutes with grep and perl, I don't want to know about it.

    Divide mail into 20-30 logical "folders" (files), use procmail to help sort/scan/unspam, do IMAP to get to it from Win machines, archive mail out of your working files once it gets a year old, and you're all set. Strive to keep your inbox empty (you need a proper "action" orientation with your mail folders to accommodate this). No big deal.
  • Databases and attachments by RPG Advocate (Score:1) Wednesday May 29 2002, @02:31AM
  • Use the filesystem as an interface. by imeller (Score:1) Wednesday May 29 2002, @02:53AM
  • Mail is a protocol! by Kynde (Score:2) Wednesday May 29 2002, @02:54AM
  • Use a real database by znerd (Score:1) Wednesday May 29 2002, @03:03AM
    • 1 reply beneath your current threshold.
  • Disk is cheap by oddityfds (Score:1) Wednesday May 29 2002, @03:10AM
  • by NotZed (19455) on Wednesday May 29 2002, @03:12AM (#3600752) Homepage
    I lost the plot half way through this, but here's some food for thought anyway. Now I should get back to work ...

    Z

    I think that this is looking for the solution to a problem that doesn't really exist in the first place. Although I guess it depends somewhat on what you define as 'Unix mail'.

    I'm a developer on Evolution, and primarily on Camel, evolution's email library. I'm not sure i'd rave about it (although I think Camel is a mostly beautiful piece of code ;), but it works reasonably well, and we've had a chance to try and deal with users with lots of email.

    What IS 'Unix mail'?

    I would define Unix mail as mail (rfc822 format) downloaded and stored locally on a per-user basis. IMAP, Exchange, and other remote protocols are very different beasts.

    Why are DBMS's not suitable for 'Unix mail'?

    Once you have a remote server you have to do things differently than if you have local access. Using a DBMS, and having a trained administrator to manage it are practical considerations, as are the benefits you might get from this configuration. These solutions dont really make sense for standalone users. They shouldn't need to install and manage databases, complex backup prodedures, and so forth, just to read their email.

    i.e. rdbms's are:
    hard to setup
    hard to maintain
    another major point of failure

    If however, I was to design a multi-user groupware server, then a DBMS would come into serious consideration - at the backend at least. It allows you do to things like easily consolidate authentication outside of the operating system (the idea of having a 'shell account' to access mail is somewhat outdated), it allows you to save space by storing common data, like attachments and email content in a single place, and redirecting it to multiple recipients (which is a common practice within organisations). It may be practical to use a mixture, a RDBMS to store textual parts or indices to data stored in a more conventional filesystem.

    But even with a RDBMS backend, I would personally probably still stick to IMAP to serve it to actual clients. The IMAP protocol is a bit heavy, but not really that bad, and it serves email, I dont think there's really any need to reinvent the wheel here.

    So ...

    If you define unix mail as I have, and separate it from a *mail server*, then you rule out full blown RDBMS's, and are left with:

    single file database
    multiple file database

    I'm not even going to mention XML because I think it is the single most stupid idea anyone's come up with. It is completely unsuitable for this purpose.

    And well, there's really no reason not to use MIME to store the messages. MIME already does everything you can possibly do with email (since, uh, it is how the email *will* be sent), any client will already have to deal with it, and mime decoding is for the most part really quite simple and fast anyway. Translating the mime format into some other storage format really doesn't make sense.

    single file databases

    mbox

    Mbox is a single file database. Its just that everyone that uses it generally writes their own access code. This is where problems with 'locking' come about, either because the underlying filesystem doesn't support it properly (e.g. some nfs implementations), or everyones clients don't use the same locking mechanism. This really just an implementation issue anyway. There would be nothing to stop someone writing a common 'mbox.db' library that stored everything in completely compatible mbox files, which took all the work out of it, and then you'd have an mbox DBMS ...

    mbox scales ok, without any caching of header information it handles in the order of 2K messages in an interactive timescale, and quite a lot more if you dont mind some short delays (i.e. in the order of the time it takes mozilla to start up).

    Appending and reading is quick, and reliable - assuming the filesystem works, which is a pretty safe assumption to make. This is assuming the mailbox is first summarised at first opening, otherwise looking up messages can be slow, because you have to scan the whole file first.

    The only operation that is slow is expunging messages, and at worst case isn't really any slower than copying a whole file across to another file.

    The only other issue is agreement on the 'standard' for what constitutes an mbox file. For example. Solaris uses and honours the 'Content-Length' header, and thus it does not translate any lines beggining with "From " into the conventional ">From ". Some mail clients translate "(>*)From " into ">\1From " (using sed syntax) and visa versa, others do not. There is no standard, just some conventions, some of which aren't easy to determine either.

    Because you need to keep the whole index in memory at once, this can become expensive, but you could use a secondary database as an index into the real file. But eventually you hit a point where the cost of expunging does get too expensive. You could just archive the mail regularly, or use a format like maildir instead.

    gdbm/db/etc

    db files wrap the single file in a common api that handles all of the locking issues and access issues for you. Some have different features, e.g. querying capability, logging and transactions, etc.

    We've never tried to use db for this purpose, more just because we didn't think it was worth it. All you really get with a minimal implementation is the ability to store and retrieve a blob of data using a single key. Writing is fairly slow because the database has to manage more details for you (locking, allocating blocks, unlocking, etc). You could use multiple db files as indices to perform multiple-key searches, but they are quite slow at creating them (we tried using db for the content indices and it was way too slow).

    i.e. even if you store the data in a db file, which gives you a slight benefit of inbuilt referential integrity, you still need to provide additional indices to actually be able to use it in any useful way. Evolution suffers this problem with the addressbook which stores vCards in db records.

    Most db libraries (all?) also dont provide any mechanism to stream data. You either get the whole lot into memory, or you get none of it. So for large messages you're limited by memory (well, evolution is anyway, but it doesn't have to be). Yes, memory is cheap, but it is still a consideration, and it would certainly rule out a simple database in a multi-user environment.

    db files are also slower than native files, especially for large objects. You're mapping an arbitrarily sized chunk of data to some 'database blocks', which are then stored in an arbitrarily sized 'database file' which the operating system is then mapping to its 'filesystem blocks'.

    multifile solutions

    Well I guess this comes down to mh and maildir. mh isn't really suitable for anything, because of its just plain bad design and lack of defined semantics. There's no way to guarantee anything about its operation.

    maildir - i like. It moves the scourge of trying to implement a reliably, scalable, multiple access database almost entirely into the operating system layer. Operating systems already do this very well - they manage hundreds of thousands of files randomly written across your disks, without skipping a beat.

    No operation requires more than a single message size of data, and the operating system already indexes the message, via its filename. Sure, ext2 doesn't do such a swell job with long directories, but that can be addressed (and the same problem can be addressed on just about any platform). For 'free' you get concurrent multiple-reader, multiple-writer database access, without any of the considerable problems you have to solve to implement it otherwise.

    The maildir 'protocol' is simple, reliable, and it works.

    Again, it can easily be augmented by a client with additional indices, but for things like delivery agents who dont care about existing email, they dont need to suffer that overhead at all.

    Some other comments specific to the question:

    Compression. Personally I dont see the point. But a maildir-like structure would fit well with compression. Flat files would be the worst (e.g. mbox), and block-file formats (like db files) would also work well with compression. The good thing about email is it is 'write once', you don't edit or change the messages in the mailbox.

    External attachments. I guess its possible, but again, it isn't really worth it in most cases. Parsing MIME is *fast*. It is much faster than parsing xml, and besides, people rarely look at an email more than once or twice. There isn't much use going off and storing the attachment in a high-performance reading format if it isn't going to be accessed often, and it just places a greater burden on your server.

    base64, etc. Well, its entirely possible simply to store the messages as 'binary' format. Assuming the boundary markers are checked properly, Camel can work with binary encoded mail messages, and probably at least some other mail clients can too. There are some problems with some of the extremely broken openpgp/pgp/mime specs which suddenly say that mail transports aren't allowed to alter the *transport* encodings of some parts, but well, these specs are just braindead, and can be worked around.

    Security model. Well, talking about Unix mail, not server mail, the filesystem is adequate.

    Shared folders - is not an issue for unix mail.

    Unicode. Well you can write unicode filenames to most unix filesystems, evne if 'ls' doesn't show it right.

    MTA. Nothing could be simpler or safer than maildir as a delivery format. The mta doesn't have to care about any client-side indices, the mua will simply update them when it incorporates the new messages, etc.

    Writing libmailstore? Mate, its called Camel, and its already written. Camel already does mbox, maildir, mh, it can read spool files directly (it doesn't create a summary file or build any indexes), it can talk imap, pop, and partial support for nntp. If someone gave me a decent RDBMS table schema and a carton of pale, I could probably write a MySQL backend in a couple of days, well, assuming the MySQL api is mt-safe.

    Finally, some comments on evolution.

    Evolution isn't reinventing any wheel. We use standard mbox format (if such a thing really exists anyway). We use standard maildir format, etc. Yes we may optionally create body indices, and we do usually create on-disk binary/compressed 'summaries' of the data, but these are really just on-disk caches of in-memory data structures, rather than anything to do with the mail storage format.

    We put mail in another location, but everyone else has done that too, elm:Mail, pine:mail (or is it the other way around?), netscape:ns_mail, etc. At least we now offer the option to read most of this 'in place'.

    The main problems evolution has with scalability is:

    indexing.

    Indexing is quite costly. The original index code was written somewhat like a database, it handled all internal data structures, used blocks of data, etc. It was slow, it scaled poorly. Definetly some of the algorithm choices and the implementation wasn't that hot, but it shows that such a solution isn't as simple as at first thought. Using libdb was impossibly slow (like several orders of magnitude slower).

    The new stuff is a lot better, but can still use a lot of resources while indexing, and copies the whole file (well 2 files) across when performing expunges, but they are only performed occasionally, and the indices are smaller than the original indices, so in practice it scales much much better.

    the summaries

    The summaries are indices of a sort anyway. They are an in-memory tree of a subset of the information on each message. Enough information to display a list of messages, and perform vfoldering operations. Even though we do some tricks, like sharing common strings, the summary can get very large.

    But, its a tradeoff I thought was worth it, rather than using on-disk summaries. The api's are much easier to use, and the problem gets pushed to the user - if they want to have folders with 100K messages, they should expect it to use a bit of memory. The on-disk size of the summaries is very small too, although I guess it could be made even smaller if we consolidated common strings.

    per-message memory use

    Currently, a lot of data gets copied around in memory. Every time you read a message, at least 1 whole copy of the (decoded) message is in memory at a given time (yes, including attachments). For IMAP this can get even worse (2-3 copies of a given attachment at a given time), because it doesn't stream enough. Most of this could use a disk-backing without changing any api's though, and well, i'm rewriting IMAP.

    Wrapping up ...

    And yeah, we're talking 100K messages here, not 1400. My 500Mhz celeron laptop has about 35K messages stored over about 10 mbox files, and it starts up in under 10 seconds, and that includes all of the bonobo/activation overhead (which is very significant). Yeah it uses a bit of memory, but memory is cheap on a personal workstation.

    In short. The current mailbox formats we have suffice for "Unix mail". Add some archiving abilities to your mail client (even RDBMS backed mail clients need archiving), and you'll never have to delete a message again, and still get work done and still use mbox.

    If you want to talk about writing a server - well who cares, you can do whatever you want, because everyone has to go through your interface anyway (you DO NOT want clients accessing data under you, thats what DBMS's are all about in the first place ... and you dont want 1-tier applications), so it doesn't matter what format you use under the belt - you can choose the format which best suits what you're trying to do.

    It seems some people think using 1-tier applications (client code talking directly to a database) are the way to go for multi-user environments. They're not, they dont scale and are impossible to maintain. Nobody writes any real software like that anymore, unless you're writing dodgey vb toy apps.

  • Thoughts I had regarding distributed mailboxes by riflemann (Score:1) Wednesday May 29 2002, @03:13AM
  • Use standards by dybdahl (Score:2) Wednesday May 29 2002, @03:14AM
    • 1 reply beneath your current threshold.
  • Back up your critique with some numbers please! by Jack Hughes (Score:2) Wednesday May 29 2002, @03:16AM
  • mbox.funkified by yem (Score:2) Wednesday May 29 2002, @03:32AM
  • Some thoughts (Score:3, Interesting)

    by ChrisJones (23624) <chris@bla c k -sun.co.uk> on Wednesday May 29 2002, @03:33AM (#3600790) Homepage Journal
    There seem to be two discussions going on in the comments today, one about mail storage for an MUA and one for storing mail on servers.
    As far as the client end is concerned, from the point of view of writing an MUA, having an SQL backend is a complete godsend because you have to write virtually no IO code, you can put all the logic in the queries. However, there are some tricks you need to use to keep up the speed, most importantly to use two tables, one for metadata and one for the mails themselves. This keeps the speed up by keeping the metadata table small (maybe on a better RDBMS than MySQL this wouldn't make a difference, but I found that >10,000 mails all in a single table in MySQL got quite slow until I moved the metadata into a seperate table).
    The obvious downside of using a DB for client end storage is that you have to have a centreal DB server, or one on each client and you need to admin one more set of authentication/permission details, plus you can't move the mail very easily to other MUAs. IMO a much better solution would be to keep the use of SQL/RDBMS, but move the DB into the filesystem so you can just have a bunch of files with metadata stored in the fs. Need to make an mbox? "cat ~/mail/* >>/tmp/my_new_mbox".
    From the server point of view, many people have been mentioning Exchange/Domino etc. Personally I can't stand Exchange, I've had to admin it on several occasions and it's generally done everything it can to stop me from having an easy life (just thought I'd air my predjudice against Exchange in the spirit of fairness and honesty ;) I've never used OpenMail/Domino/Notes/whatever, but I guess they do roughly the same thing, which is a pretty good idea. However, these things all have the distinct disadvantage that they use propritary protocols and aren't particularly cheap. There's always IMAP, which many people really like, but I feel is too complex a protocol (compare with the infant levels of complexity in POP3).

    With a colleague of mine, I'm working on a set of POP3 extensions that give some IMAP like features, but is really designed to keep multiple mail clients in sync with each other by way of a transaction log. There are still some limitations, but I think I know what they are and how to fix them (e.g. not enough metadata can be associated with each mail yet). It adds about 6 or 7 commands to POP3 and currently lacks any decent client support, but I have written a fairly usable library and patch to gnu-pop3d for it. I've just submitted it as my University final year project, so I'll try and get the protocol description documentation online soon. In the mean time, if you're interested, it's on SourceForge [sf.net]
  • Didn't we solve this with NNTP? (Score:3, Insightful)

    by speedenator (83485) on Wednesday May 29 2002, @03:43AM (#3600808)
    So NNTP solved this IMHO a rather elegant way...

    You have directories corresponding to newsgroups or mail folders or whatnot. i.e. alt.swedish.chef.bork.bork.bork is really alt/swedish/chef/bork/bork/bork

    Articles are numeric, i.e. \d+ for Perl types. The raw message is stored in each file.

    In each directory, there's a file called .overview, which is just the summary information for all the files.

    Thus, you can have zillions of small files, and happily grep and copy them to your heart's content. But you never do a 'ls' on a huge directory, you always just look through the .overview file. Or grep through it, if you like.

    So, in that sense, it's very much the best of both worlds. And, on the same box, you can specify rules on who can access the folders, so one file can be read by multiple people. Ooh.

    GNUS, an Emacs based mail/news reader, uses a variant of this called nnml, which rocks.

    Of course, when you get down to it, JWZ arguments aside, databases start to really look like what you want, especially on a corporate level when you're tossing the same piece of mail around to tons of different folks.

    -e
  • TheBat! by the_danielsan (Score:1) Wednesday May 29 2002, @04:28AM
  • by ellem (147712) <ellem52@NOsPAm.gmail.com> on Wednesday May 29 2002, @04:46AM (#3600902) Homepage Journal
    Are you trying to tell me that a 5MB empty mailbox is asking too much? A text message that says "Hi!" costing 1.2MB is somehow wasteful?

    Lotus Notes Uber Alles!
  • A alternate proposal by Anonymous Coward (Score:2) Wednesday May 29 2002, @05:29AM
  • What's wrong with your MUA? by jpn-sdot (Score:1) Wednesday May 29 2002, @05:32AM
  • How many users would LIKE to use email by MarkedMan (Score:1) Wednesday May 29 2002, @06:01AM
  • Client issue by gidds (Score:1) Wednesday May 29 2002, @06:08AM
  • the plan 9 approach (Score:5, Interesting)

    by rpeppe (198035) on Wednesday May 29 2002, @06:23AM (#3601060)
    as a basis for an approach i like what plan 9 [bell-labs.com] does. the mail is made available to clients as a filesystem (provided by a user level program). each mail message gets its own directory; each mime attachment gets its own subdirectory within that message (and recursively, as MIME is recursive).

    here's a little transcript:

    % cd /mail/fs/mbox
    % lc
    Directories:
    1 113 128 142 157 171 186 20 214 229 243 258 272 287 300 315 33 344 359 373 388 401 416 430 445 46 474 56 70 85
    [...]
    % cd 318
    % lc
    Files:
    bcc date filename info messageid rawbody sender type body digest from inreplyto mimeheader rawheader subject unixheader cc disposition header lines raw replyto to

    Directories:
    1 2 3
    % head raw
    Return-Path:
    Received: from punt-1.mail.demon.net by mailstore for rog@vitanuova.com
    id 1021665470:10:17045:138; Fri, 17 May 2002 19:57:50 GMT
    Received: from psuvax1.cse.psu.edu ([130.203.4.6]) by punt-1.mail.demon.net
    id aa1016828; 17 May 2002 19:57 GMT
    Received: from psuvax1.cse.psu.edu (psuvax1.cse.psu.edu [130.203.6.6])
    by mail.cse.psu.edu (CSE Mail Server) with ESMTP
    id 27DA4199BE; Fri, 17 May 2002 15:57:13 -0400 (EDT)
    Delivered-To: 9fans@cse.psu.edu
    Received: from acl.lanl.gov (plan9.acl.lanl.gov [128.165.147.177])
    % head body
    This is a multi-part message in MIME format.
    --upas-mbyuptynpdsmbjuyeermihdgur
    Content-Disposition: inline
    Content-Type: text/plain; charset="US-ASCII"
    Content-Transfer-Encoding: 7bit

    Hi,

    If you seek excitement and thrills you need to look no further than
    Plan9 -- it gives you everything and then some, but in a good way (or
    % cd 2
    % lc
    Files:
    bcc date filename info messageid rawbody sender type
    body digest from inreplyto mimeheader rawheader subject unixheader
    cc disposition header lines raw replyto to
    % cat mimeheader
    Content-Type: image/jpeg
    Content-Disposition: attachment; filename=iostats.jpg
    Content-Transfer-Encoding: base64
    % page body
    reading through graphics...
    %
    "raw" contains the raw data that makes up the message. "body" contains the data after the encoding formats have been applied (hence in that case /mail/fs/mbox/318/2/body is a jpeg file, viewable directly by any usual jpeg viewer).

    the beauty of this scheme is that it hides the underlying storage scheme from the mail clients. if i wish to change things so that the underlying storage format is many files [currently it uses a traditional mbox format], none of the mail client programs have to change.

    plus i can use grep, diff, shell scripts, etc directly on the messages in my mailbox. procmail eat your heart out.

    • Re:the plan 9 approach (Score:4, Insightful)

      by glv (31611) on Wednesday May 29 2002, @07:06AM (#3601151) Homepage
      You alluded to this, but I know slashdot, and it's worth being explicit about it to avoid all the flames:

      This is not how mail is actually stored on disk in Plan 9. The "real" mail storage is just mbox files. What rpeppe has described is the view that the mail storage system provides to clients.

      I agree it's very sweet, but the question is primarily dealing with the actual storage format.

      [ Parent ]
    • Re:the plan 9 approach by Pegasus (Score:1) Wednesday May 29 2002, @08:54AM
    • Re:the plan 9 approach by Prometheu (Score:1) Wednesday May 29 2002, @10:49AM
  • Use news! by Inode Jones (Score:1) Wednesday May 29 2002, @06:38AM
  • I had some plans, but i need to learn Java/Perl by eatmeat (Score:1) Wednesday May 29 2002, @06:50AM
  • SQL by Julian Morrison (Score:1) Wednesday May 29 2002, @07:10AM
  • One word. by 42forty-two42 (Score:1) Wednesday May 29 2002, @07:27AM
  • Take a look at Mercury by Havokmon (Score:2) Wednesday May 29 2002, @07:27AM
  • I've tought about this, and.. by Fweeky (Score:2) Wednesday May 29 2002, @07:50AM
  • cyrus? by jochen (Score:1) Wednesday May 29 2002, @07:56AM
  • Learn from non-Unix models, too by mwood (Score:2) Wednesday May 29 2002, @08:50AM
  • Oracle Mail by nixkuroi (Score:1) Wednesday May 29 2002, @09:26AM
  • Know your system administrator by gypsyx (Score:1) Wednesday May 29 2002, @09:27AM
  • Maildirs are not slow by DrProton (Score:1) Wednesday May 29 2002, @10:08AM
  • Maildir for 1000+ messages? by Jobe_br (Score:2) Wednesday May 29 2002, @10:14AM
  • I don't think it matters by nsayer (Score:1) Wednesday May 29 2002, @11:53AM
  • Time based file systems for mail storage.. by SpamapS (Score:1) Wednesday May 29 2002, @12:21PM
  • Cyrus by Lumber Cartel Czar (Score:1) Wednesday May 29 2002, @12:52PM
  • by mellon (7048) on Wednesday May 29 2002, @01:01PM (#3603223) Homepage
    I don't think it makes sense to store email in dbm files. It's too sketchy - what happens when the dbm file gets corrupted? The nice thing about flat files is that if something goes wrong, you can fix it with vi.

    I think the right solution to the problem is to key off the message ID, which is supposed to be unique. Then define a mail folder as simply a list of message IDs. Messages can appear in more than one folder, but hopefully not in no folders.

    To make this efficient, I'd hash the message ID, and use a hierarchy of directories, because Unix doesn't do well with large flat directories. The hierarchy could auto-extend, so that as one subdirectory fills up, you do a sub-hash and split it into more directories.

    The problem of tiny files is a real one. The solution is probably to make the bottom of a hash a file rather than a directory, and store more than one message in each such file. You don't have to store a lot of messages in these files to win - even ten messages would produce a big win, and would be pretty efficient.

    The format of the individual files should probably be indexed sequential access - that is, a TOC at the front, and then the contents as plain text, nothing fancy. The TOC should be in ASCII, not binary, and you should be able to rebuild the TOC by looking at the file.

    Babyl used to use a control character as a delimiter, which worked pretty nicely - much better than using "^From ". Ever seen >From in an email message? That's because Unix mail uses "^From " as an inter-message delimiter, so it has to quote it, and it does so stupidly. So use ^_ as a delimiter, and if ^_ appears in the email message, just double it. Take a doubled ^_ out when reading a message.

    As for compression, I don't think it's worth doing at first. Disk space is cheap. Yes, my email folder is pretty huge, but it's really not a major problem. Making the storage system extra-complicated by uncompressing MIME is something to add on after you've got something more basic that works - you don't have to solve every problem all at once.

    As for folder scan performance, you can make a cache, and have the mail program scan the cache from time to time when it's idle to clean up errors. This is much better than trying to come up with a format that's optimized toward folders - if you try to optimize toward folders, you wind up creating all kinds of problems, IMHO.
  • Throwing out the baby with the bathwater. by Anonymous Coward (Score:1) Wednesday May 29 2002, @01:22PM
  • AMS : The Andrew Messages System by mpb (Score:1) Wednesday May 29 2002, @01:37PM
  • Maildir and 1000+ Mails by PCGod (Score:2) Wednesday May 29 2002, @02:00PM
  • One file per e-mail part is the only way to go by rici (Score:1) Wednesday May 29 2002, @04:54PM
  • Evolution == mbox, mh, or Maildir by RonVNX (Score:1) Wednesday May 29 2002, @04:55PM
  • abstraction.... by perlchild (Score:1) Wednesday May 29 2002, @06:17PM
  • Re:All in one solution by ryochiji (Score:1) Tuesday May 28 2002, @10:59PM
  • Re:begat by pixel.jonah (Score:1) Tuesday May 28 2002, @11:51PM
  • Re:Want to save space? by Jarvo (Score:1) Wednesday May 29 2002, @12:27AM
  • Re:Want to save space? by GravySkin (Score:1) Wednesday May 29 2002, @07:27AM
    • 1 reply beneath your current threshold.
  • Re:/., come for the intellectual discussions by hazen_vs (Score:1) Wednesday May 29 2002, @01:42PM
  • Re:and now for some politeness. by Inthewire (Score:1) Thursday June 06 2002, @06:50PM
  • 45 replies beneath your current threshold.
(1) | 2