Forgot your password?
typodupeerror
Unix Operating Systems Software

Organizing Large Volumes of Email? 24

Posted by Cliff
from the well-the-mbox-format-does-suck-for-searching dept.
Trixter asks: "Like most nerds, I receive a large volume of email that I archive in several files and directories in a filesystem. This is inefficient, especially when it comes to searching for an old or obscure bit of information. I can imagine several better ways to organize email for archival and lookup, but has anyone already done this? I want to try avoiding reinventing the wheel for the tenth time this year. By 'better ways', I'm talking about all solutions--from the Perl monger 'one 10-line script will do the trick' perl script to parse up a long mbox-format file into little bits for intelligent grepping, to maybe an elegant 'mbox-format file to SQL database' loader/translator script and a series of SQL statements to support searches. Please, help me organize my gigabytes-long, decade-long email archive!"
This discussion has been archived. No new comments can be posted.

Organizing Large Volumes of Email?

Comments Filter:
  • But some email programs out there do a good job of archiving email.

    For example, in Lotus Notes, you can create a Archive database of you email and then you can perform full text searches on it. The solution is very easy and extremely non-techical. I'm sure there are other email programs that allow archiving and similar features.

    1. Print.
    2. File.
    3. Delete.
    You weren't looking for something elegant, were you? :)
  • by kevin42 (161303) on Wednesday September 06, 2000 @07:15AM (#800688) Homepage
    I've got the same problem, and I've handled it by sorting my email into different mbox files based on content and/or who it's to/from. Then used a bit of perl to take old messages from a cut-off date and archive them into a directory with an identical layout (i.e. the same folder configurations)

    so I have something like this:

    Pending
    Misc
    Friends
    Pre-1998\Pending
    Pre-1998\Misc
    Pre-1998\Friends

    Then I use hypermail to create an html archive of everything nightly, and put it into a password protected directory on my webserver. Then I use a regular web based search engine for searching.

    Right now I'm playing around with doing all my mail via a web interface (using aeromail->imap) so I can access it securely (SSL) anywhere. It's working pretty good, I just need to figure out a good way to notify me when I get a new message (I'm thinking of a ICQ bot that sends me a message or something...)

    I hope that helps. I'd be happy to work with anyone who wants to creat a better shrink-wrapped system for managing large amounts of old email. To me it's important that however it's stored that it is very portable since I've changed email clients a lot over the years.

  • I use procmail (http://www.procmail.org) to sort incoming mail into Maildir (http://qmail.org) directories, instead of mbox files. I sort by subject, sender, etc. This lets me use grep, etc. and get a better fix of where the message I'm interested in is located.

    I also use Mutt (http://www.mutt.org), and since it knows about the Maildir format, mail is pre-sorted before I even see it!

    After a few years, however, even this approach runs out of steam... but it's still more automatic than by deliberately saving mail into different folders as you read it, and there's virtually no scripting/coding required.

    b.g.
  • One, create an intranet (yes, I hate that word too, but everyone knows what I'm talking about) webserver. Use something like MHonarc [mhonarc.org] to archive the mail, and then install a search engine on the webserver.

    Two, and I like this way better, write a couple of tools to handle this (perhaps based on MHonarc, heh) which will stick your mail in a database (Interbase, Postgresql, Mysql, whatever) and then do yet another web (or otherwise) application which will let you read them out and search them, interactively.

    I personally hold onto my mail by renaming my mailbox (For instance, on the 240sx.org mailing list) to 240sx-preYYYYMMDD, then creating a new 240sx mailbox. I use Mozilla on win32 for email (despite the amazing crashiness - It feels like I'm using Corel software or something) and it has a fairly decent search facility.

  • by human bean (222811) on Wednesday September 06, 2000 @07:43AM (#800691)
    As mentioned elsewhere, ASCII text files are the best way of dealing with long term storage of email.

    The other necessary item is the fastest file content search utility you can lay hands on.

    Indices, tables of contents, folders, catagories, catalogs, and directories will always have misfiles after a certain period of time, and besides, who wants to categorise all that mail anyway? Dump it into a few (time-based?) directories and simply search for things when you need them.

  • Lotus Notes is closer to a proprietary database solution than an email client. It's got its own entry [iarchitect.com] in the Interface Hall of Shame [iarchitect.com]. Some of its developers admit the email functionality is limited and also think there should be formal training before being allowed to use Notes. [iarchitect.com]

    Sorry, seeing "very easy and extremely non-technical" associated with Lotus Notes kind of set me off...

  • ...if you have e-mail from mostly the same people, I find my way to be quite useable. Once I read a message, I dump it into a folder named after the person I got it from... Generalized mail gets thrown in various other folders such as "Online shopping", "Passwords", "From my ISP", etc... everything else gets thrown in "Misc".

    Your library of e-mail sounds complicated enough to warrant dumping into a nice PostgreSQL database or something and writing a frontend to search the thing... You could use Perl... I'd probably play with PHP on my web server. That's about the best idea I can come up with.
  • .
    Final thought: don't keep everything. Mailing lists are usually deleted on read. Any mailing list worth reading (and worth reading two years from now) is being archived on the web somewhere. I see no need to create a local mirror. The hardest part of being an archivist is knowing what to throw away.

    Yeah Matt, but you also keep archives of old BBSes around as well. :) Where's Panther's Byte on the Web?

    Actually, I'd second the ASCII method, but add that if you don't want to use mail(1) (because you want to set up and access filters || triggers || subfolders while browsing html messages or threaded subject lists), the mbox format has been around for long enough that I feel confidant in leaving my email in that format.

    There are plenty of nice GUI interfaces to mbox (I use KMail from KDE2). Just make sure when you choose your mail software that it won't permanantly delete *anything* (as most do my default). If you trash something, it should go into a "I probably won't need to see this again" file.

    I finally gave up about a year ago, and started trashing Spam. Until then, I had every single piece of email I'd gotten via the net. Unfortunantly, much of my early to mid 80's stuff is on AppleDOS 3.3 disks. Post PC/MS-DOS, I have everything in one mega /pub directory which (after deleteing pr0n), barely makes it onto one DSS3 DAT tape.

    And one :~( 40 meg RLE drive that I can't access.

    --
    Evan

  • I use Xemacs with the Gnus newsreader with the nnml back end (that stores eache message in its own file) for my mail, and each message area in its own directory. It works fairly well, performance on a dual 466mhz celeron with 128MB ram is good. It takes about 1.5 min to open, or to switch from sort by date to threaded, sorted by date, in my largest message area, which has over 15000 unread messages totaling over 82MB, and a modest number of read and saved messages.

    Right now, it hides deleted messages, and deletes them after 30 days (handy now and then) this also saves me the hassle of emptying a trash can now and then.

    If I used scoring, I could do a lot of neat things, such as deleting unread messages that I am not likely to find interesting after a certain amount of time, or having me go straight to likely interesting messages when I open a message area.

    I use BBDB for an adress book, (new homepage http://www.waider.ie/hacks/emacs/bbdb/ ) which can be synced with a palm pilot (just started to work) http://home.rochester.rr.com/tsdeweese/SyncBBDB.ht ml

    Over all, I'm happy, I just need to learn a bit of emacs lisp.
  • I do the same, as use ht://Dig to go through mail...
  • Use an indexing and searching tool such as Harvest [freshmeat.net]; look at the several tools which index files and choose what looks best.

    You first could run your archive into something like a Hypermail-style archiver [freshmeat.net] if you prefer HTML (in addition to whatever indexes the archiver creates).

  • by InitZero (14837) on Wednesday September 06, 2000 @07:29AM (#800698) Homepage

    I've got everything I've ever written on a computer -- email and else -- since around 1981.

    It ain't always pretty but I'll tell you the secret to my success. Plain text. ASCII. If you want to be able to read what you've written now ten years from now, keep it ASCII.

    Thanks to a basic format, I was able to convert my TRS-80 Model I tapes to Model 4 disks and my Model 4 disks to the 20-meg drive in my first Tandy 1000. From there it has been easy. My new harddrive is always huge compared to the last one so my old data usually takes up a third of the new disk. No big deak.

    I read all my email with 'mail' thus protecting myself from viruses and funky email formats (Eudora, Outlook, CCMail, etc.). At the end of the month, my mailbox it is dated, rotated and gzipped. The header information (Date, From and Subject) is added to a master index file along with the filename where the message can be found.

    I've got a few ugly scripts that will search by keyword so I can find old stuff.

    Yes, I'm living in the stone age. Those of you able to read your email going back to the early 1980s feel free to throw stones.

    I think putting the stuff in a database would be a bad idea. When you change platforms, there will be maintenance. When you change databases, there will be maintenance. With plain ASCII text, you know you'll always be able to read it and you never have to upgrade. (Okay, by using gzip, there may be some extra effort on my part. A few years ago, I moved everything from compress to gzip. That's not the same thing as going from Oracle to Sybase, however.)

    Final thought: don't keep everything. Mailing lists are usually deleted on read. Any mailing list worth reading (and worth reading two years from now) is being archived on the web somewhere. I see no need to create a local mirror. The hardest part of being an archivist is knowing what to throw away.

    InitZero

  • Good point, and it keeps with The Unix Philosophy (sounds like a Greek god, cue shaft of light through cloud). I have done that with email I consider precious (email from my girlfriend, mostly), and I'll probably start a system like yours when I get a little time.

    But, (and there's always a big but) kevin42 [slashdot.org] had a good point [slashdot.org] about email access when away from 'the big box'. I just moved home from the dorms, and I'm using my father's home machine for net access, so instead of Eudora or Opera for my email, I'm using Hotmail. [shiver]

    I would really like to be able to put my recent email in a web searchable format that I can access from anywhere. Maybe even include PGP/GPG options, if access is via SSL. Basically, I want to be able to get to my email from other machines; whether that means SSL, sniffable web access, or carrier pigeon, I don't really care. I just want to read and send email.

    Louis Wu

    "Where do you want to go ...

  • I use IMHO [lysator.liu.se] to do pretty much what you describe: One box running IMAP that I either access from a normal client, or via the web mail interface. It works great for me.
    Plus, I can keep an eye on the security of my mail, and be (reasonably) sure that no one is snooping around.
    --
  • Oh, and I use Xemacs its self to filter my mail. I used to use procmail with VM, but with Gnus, this is easier.
  • One minor point: Eudora keeps its mail in standard ASCII mbox format.
  • While it's probably not ready for prime-time yet, won't Helixcode's [helixcode.com] Evolution [helixcode.com] eventually do all these things? I know it won't be the perfect solution for everyone, but it's way of storing queries as "folders" and other features should eventually make it quite powerful.
  • Thanks for your suggestion, but this is what I'm already doing and it's not cutting the mustard any more. Searching, for example, takes ages:

    find ~/Mail -type f -exec grep -li $WORD) {} \;

    There's not much you can do to speed this up, sadly, and the number of files/directories/folders/mailspools is becoming unmanagable.

    As for saving *everything*, I don't. I just happen to receive 50+ emails every day that have information in them I need to file.

    PS: Everything you've written back to 1981 including (I'm assuming) Super Scripsit files? Kick ass!

  • I think putting the stuff in a database would be a bad idea.

    Although once you have all your mail in a database it is a simple step to write a script to write it back to plain text again. Or even to any other format you want.

  • Sounds like a solution to your problem would be IMAP + a web frontend so that you can read/post from anywhere. IMAP lets you read and organize your email from anywhere, as the mail is kept on the server. There's an IMAP web frontend written in PHP that you could implement fairly easily, if you had nice (read: compiler) access to a box that was always on the 'net...
  • OTG software www.otg.com has a product called email Xtender, it can do LOTS of cool stuff that you are looking for. Automatically copy every email and attachment into an enterprise message center Harness a powerful full-text search engine to give complete access to your messages and attachments Reduce email server stress and bottlenecks by automatically classifying and migrating messages and attachments etc, etc, etc, It works with MS Exchange, sendmail, notes, and a few others. These folks make some real quality products for document imaging and storage management also. Give them a call, they will be happy to come demo it for you.
  • I find MH works well. Plain old text in seperate files. I use exmh as my wrapper. Use procmail to sort incoming messages. Use ifile to automatically sort the rest (it watches you move messages & will start doing it for you.).

    Then I use glimpse to index everything. I'll admit I'd like a better search engine, but it works well enough and exmh has a nice interface.

    I had a script that would go though a folder & refile into new sub folders based on year and month too.

    All I need to make it perfect is a text wrapper for MH that can navigate subfolders so I can have reasonable speed & usability over a remote link.
  • There's a program called 'rgrep' that will do recursive searches. That should work much better instead of anything having to do with 'find'.

    Also, something like 'find . -type f | xargs grep foo' will usaually run faster than trying to get find to do the dirty work.

Put your Nose to the Grindstone! -- Amalgamated Plastic Surgeons and Toolmakers, Ltd.

Working...