Forgot your password?

typodupeerror
Communications Data Storage

Best Way To Archive Emails For Later Searching? 385

Posted by timothy
from the shrink-wrap-them-and-use-sandwich-bags dept.
An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"
This discussion has been archived. No new comments can be posted.

Best Way To Archive Emails For Later Searching?

Comments Filter:
  • IMAP (Score:5, Informative)

    by klingens (147173) on Monday September 06 2010, @11:44AM (#33488884)

    An IMAP server (dovecot, cyrus, courier) of your choice for Linux. If you don't have a Linux server you can always run it inside a small VM.

  • Not Much (Score:2, Informative)

    by maxume (22995) on Monday September 06 2010, @11:51AM (#33488950)

    It isn't particularly platform independent (because no one is paying much attention to Windows), but Not Much offers threads and full text search:

    http://notmuchmail.org/ [notmuchmail.org]

  • Gmail? (Score:5, Informative)

    by spiffydudex (1458363) on Monday September 06 2010, @11:52AM (#33488956)

    While not open source, Gmail has a good search engine that isn't sluggish. Plus it has roughly 7.5 gigs of space to store data. Use IMAP to push all of your emails to the server and then use that Gmail account for archive email only.

  • Re:IMAP (Score:2, Informative)

    by hedwards (940851) on Monday September 06 2010, @11:53AM (#33488964)
    Yeah, IMAP is the way to go, personally, I use IMAP on my email account and mailstorehome [mailstore.com] to do the actual download and backup. The OP will probably end up having to set up a personal server to get the program to download the older mail, but that can be done easily enough via a virtual machine.
  • Maildir (Score:5, Informative)

    by alexhs (877055) on Monday September 06 2010, @11:56AM (#33489006) Homepage Journal

    Maildir [wikipedia.org].

    And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread

  • Good IMAP Server (Score:5, Informative)

    by caffeinejolt (584827) on Monday September 06 2010, @11:58AM (#33489022)
    If this is really important to you, and you want it all to work across multiple workstations/OSes, your best bet will be to store it all in IMAP [wikipedia.org]. If you have the means and motivation to run this yourself, I would recommend Dovecot [dovecot.org]. If you don't have the means and motivation, then you can use a service like Gmail to run your IMAP although you give up certain freedoms in doing so. For example, I use Dovecot coupled with Maildir++ [wikipedia.org] as the physical storage format - as a result I can (if I wanted to) change to any email client I wish very quickly, use different email clients at the same time, etc.
  • citadel (Score:4, Informative)

    by samjam (256347) on Monday September 06 2010, @11:59AM (#33489034) Homepage Journal

    citadel at www.citadel.org is a full pop3/imap server with full-text indexing.

    Thunderbird can use server-side searches to find messages, and I find that works pretty well.

  • by Anonymous Coward on Monday September 06 2010, @12:11PM (#33489138)
    To help spare you the precious keystrokes it would take to Google this yourself, you can go straight to “Google Apps for Businesses [google.com]” and sign-up. Now did you really have to Ask Slashdot?
  • Re:Maildir (Score:4, Informative)

    by El_Muerte_TDS (592157) <elmuerte@@@drunksnipers...com> on Monday September 06 2010, @12:12PM (#33489152) Homepage

    mairix is a useful addition to a maildir setup: http://www.rpcurnow.force9.co.uk/mairix/ [force9.co.uk]

  • by twitter (104583) * on Monday September 06 2010, @12:24PM (#33489252) Homepage Journal
    Kmail has an excellent .pst converter that will pull out your old Outlook mail. Once you have it in Kmail, you can drag and drop it into any of the supported formats, mbox, mdir etc. If you have already established filters, you can let them sort things out. If not you can use a manual search for to, from, mail list, subject, etc. From there you can run your imap. I carry everything around on my laptop and use kmail instead of using imap. With full drive encryption and xscreensaver, I don't have any worry about losing private information and know that my ISPs have better collections of my email anyway, despite what they say about size limits. I could use Gmail's imap instead of my own but prefer to suck my gmail out with kmail's imap support. Until US networks get more reasonable, I want my mail with me instead of on my own server and I would not advise anyone to leave their mail on someone else's server without having a copy yourself. Because your question is all about search, I have to plug Kmail again. With proper organization of your mail into subfolders for friends, family, lists, companies and projects, mail searches are quick, even on modest hardware like my ancient PIII laptop. Searching everything takes a little longer, but it is not such a burden. Evolution may do as well but something about Gnome turns me off. The only downside is that the 3.5 branch does not seem to be able to search through encrypted mail but I imagine there's some gpg-agent fix for that I'm not aware of.
  • Re:Not Much (Score:4, Informative)

    by koiransuklaa (1502579) on Monday September 06 2010, @12:29PM (#33489296)

    +1

    Notmuch can manage absolutely insane amounts of email without any artificial 'archiving'. Of course, if you are looking for a a program that does something else than tagging and searching (like sending, composing or receiving email), you need to look elsewhere.

  • Re:IMAP (Score:3, Informative)

    by wealthychef (584778) on Monday September 06 2010, @12:36PM (#33489360)
    Well, on OS X the searching problem *should* be solved by Spotlight, as it indexes "all files on your hard drive" (not) into constant-time searches automagically. The trouble with Spotlight is that Apple does not search all folders and I do not know of a way to enable it to search all folders. If you import it into Mail.app, you do get the indexed behavior, and my situation is similar to yours, and I do exactly that. But all those billions of old messages, I keep in an archive that I never look at.
    Anyhow, look into Spotlight on OS X. Ooops, you said "open sourced," right? Damn. I don't know, then.
  • Re:Maildir (Score:2, Informative)

    by Anonymous Coward on Monday September 06 2010, @12:43PM (#33489416)

    Maildir [wikipedia.org].

    And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread

    With the proviso that you probably want to break up your archives in something akin to the following format:
    . 2009
    . . Q1
    . . . Sent
    . . . Received
    . . Q2
    . . . Sent
    . . . Received
    [...]
    . 2010
    . . Q2
    . . . Sent
    . . . Received

    Lots and lots of messages in a directory can cause problems with many file systems. If you have more than say ~8K or so messages in a folder, I'd recommend breaking it up. At work this is what I do at work (CY/Qx/Sent-Received), which also allows me to move entire quarters into PST files when I hit my quota, but also keep a bunch of stuff online for access via web mail. Instead of quarters you can also do "halves"--H1 (Jan-Jun), H2 (Jul-Dec)--and then Send/Received with-in that.

    For my personal account, I generally haven't broken 4K messages per year yet, and so simply have (CalendarYear/Sent-Received). I currently have everything going back to about 1997 or so.

    Except for sorting mailing list traffic, I don't use folders besides Sent & Received: new messages go into my Inbox or a list folder. If I want to keep the message, it always goes into Received regardless of where it originally arrived. I try clear out the mailing list folders daily so there's no build up (select-all, delete). If I want a message that I got, I know it's going to be in one of the Received folders, and so don't have to digging in a dozen different places trying to figure out some sort of classification system.

    Work is Exchange/Outlook; personal is IMAP.

  • Echo chamber... (Score:5, Informative)

    by MrNemesis (587188) on Monday September 06 2010, @01:00PM (#33489522) Homepage Journal

    ...has me doing a "me too!" to everyone telling you to use IMAP + maildir; I use dovecot myself, complete with self-signed SSL cert (curse you firefox!).

    El_Muerte_TDS [slashdot.org] has just pointed me towards mairix [debian.org], a dedicated maildir + friends indexing system which I've just tried out, and seems to be ideal for my use - fast email search has always been a good thing for me, but I've rarely found a nice lightweight indexing solution that was catered only to mail; "desktop" search engines tend to take the opinion that if I want one thing indexed then I automatically want everything indexed, and also insist on running around the clock. Much nicer for my needs to just have one little lightweight indexing program that only runs when I want it to.

    Best thing about mairix IMHO is the way it creates a virtual maildir on the fly using symlinks, so not only is it easily viewable on the command line, it's also automatically compatible with all of those IMAP + maildir clients out there... which, last time I looked, was all of them. Useful hack for KMail users here [netmanagers.com.ar].

    Disclaimer: my IMAP server has all its databases on an SSD, so even full text searches from the client are pretty speedy (seriously - the lack of access times on small chunks of random data cuts down search times by at least an order of magnitude), but obviously mairix has the advantage of being able to scale to multiple users with >X GB mailboxes much easier than spending a fortune on fast storage.

  • Re:IMAP (Score:4, Informative)

    by 19thNervousBreakdown (768619) <davec-slashdot AT lepertheory DOT net> on Monday September 06 2010, @01:20PM (#33489688) Homepage

    Seconding this. I've been using Dovecot with Maildir on EXT3 for the last few years--my mailbox is about 25k messages, which I keep all in a single folder and use IMAP tags to organize into different virtual folders, much like Gmail's system but without the privacy concerns.

    Dovecot's supplementary indexes makes everything extremely fast (tags, dates, etc), and anything it doesn't catch Thunderbird does, I can search my entire mailbox for a single word in less than a second. I lose my Thunderbird indexes whenever I move to a new computer, but that's just a matter of leaving the client up for a few hours.

  • Re:Maildir (Score:3, Informative)

    by Sancho (17056) * on Monday September 06 2010, @01:24PM (#33489714) Homepage

    mairix is good, but it has some warts and it is not under development anymore. Among other things, it can run out of memory, has problems with parsing certain multipart messages, and can't search for an IP address (or any other string with dot-separated tokens.)

    It's about the best I've found, but I wish someone would pick up development and fix some of the issues. As time goes on, bit-rot is going to set in and mairix will get less and less useful.

  • by zekele2 (1556449) on Monday September 06 2010, @01:36PM (#33489858)
    This is why I pay Google for handling my email. I use Google Apps Premier Edition with my own domain. $50 per user per year, it's cheaper than paying for Office/Outlook, there's 25Gb of space per user, and NO advertising. Using my own domain means there is no lock-in, I can use IMAP and switch to another provider any time.
  • Re:MySQL (Score:2, Informative)

    by Anonymous Coward on Monday September 06 2010, @01:38PM (#33489882)

    I've been using hMailServer for a few years now, and it's free. It's an ISP solution, but has some great IMAP facilities like shared storage. I have over 120GB of IMAP data and it's not twitching. It has a MySQL Lite backend, and capabilties for web, pop, imap, AD integration, etc etc. Would recommend it.

  • by AndGodSed (968378) on Monday September 06 2010, @01:47PM (#33489956) Homepage Journal

    ++ the above, or Evolution - it also imports PST's and from there you can move it to Thunderbird for Windows. If you want uber searchability you could then upload the whole shebang to a gmail account that you sync offline via gears.

    I personally would balk at having all that stuff online with google but hey that would be the best searchable option I know. You can also sync with your Gmail account via imap protocol if gears and the web interface is not for you. Problem with that is that you will lose the great search capability with Gmail.

    Then again Thunderbird has some really cool search addons that might just take care of your needs altogether, plus it is platform agnostic - you can have it on BSD, Linux, Windows or Mac.

    HTH!

  • Re:mbox + grep (Score:2, Informative)

    by Anonymous Coward on Monday September 06 2010, @01:47PM (#33489958)

    I second that, and I also use mutt for this because it's so damn fast.

    1. cd Mail/old/
    2. grep -c pattern *
    3. mutt -f candidate-file
    4. use 'l' commands with patterns on mail fields, e.g. subject, from/to, body
    5. view limited message set in thread-sorted mode
    6. tag messages of interest
    7. save tagged messages to a small mbox, or attach them to a newly composed message and send

    it takes mutt about 3 seconds to load a 280 MB archive file with 16k messages on my machine, and less than a second to limit the display by recipient or about 3 seconds again to limit by keywords in the message body. I used to make an mbox per quarter year, but then I started merging them into one per year, as well as going back and purging some of the largest attachments. (Mutt also makes this easy: sorting by message size, then selectively saving and/or deleting attachments. There's usually 10-50 messages in a huge archive which are dramatically larger than the rest, and dealing with these makes the archive much more manageable.)

  • Re:IMAP (Score:3, Informative)

    by Graff (532189) on Monday September 06 2010, @01:54PM (#33490044)

    No way to archive Entourage data with Spotlight.

    Actually, there is:
    Using Spotlight to search Entourage. [mvps.org]

  • Re:IMAP (Score:3, Informative)

    by flyingfsck (986395) on Monday September 06 2010, @03:53PM (#33491312)
    Totally. All my email since about 1998 is in a Citadel mail server. It uses the BerkeleyDB, which can handle 256 Terabytes of mail. That should be enough for any semi-sane person...
  • by wkcole (644783) on Monday September 06 2010, @03:54PM (#33491326)

    As anyone who actually uses IMAP can tell you, it bogs down quickly on large mailboxes, violating the poster's requirement about b)

    Not true. Not absolutely false, either. IMAP is an access protocol, not a storage or indexing mechanism, and there is nothing inherent in IMAP that dooms it to be slow in handling large mailboxes. Different combinations of client and server, configurations, and mailbox content and usage can make huge differences in performance. Tens of thousands of messages in a single IMAP folder on a memory-lean server that uses Maildir storage on a UFS or ext2 filesystem with atimes enabled is going to suck horribly, especially with a client that doesn't cache heavily or maintain its own indices. Make that a mbox, and it will work great until you start trying to change it every couple of seconds.

  • Re:Domino (Score:3, Informative)

    by Belial6 (794905) on Monday September 06 2010, @06:02PM (#33492518)
    Seriously, what is wrong with or for that matter, the Notes client [photobucket.com] web client? [photobucket.com]

    I call you out troll. I also call you out on your made up problem of not knowing if something is read or not. Unread marks replicate between servers.
  • Re:Gmail? (Score:1, Informative)

    by Anonymous Coward on Tuesday September 07 2010, @04:49AM (#33496086)

    I used to Google eMail Uploader. It's perfect! Recreate tags instead of Oulook subfolders and gets all contacts in the same time. It took~2h for +4000 emails splitted between Outlook and thunderbird. At the end it just makes a warning for only 2 emails attachments (on +4000).
    Now I'm happy to have everything at the same place in the cloud.

  • Re:RETARD MODERATION (Score:1, Informative)

    by Anonymous Coward on Tuesday September 07 2010, @05:24AM (#33496204)

    You are wrong.

    POP3 is a transport protocol.
    IMAP is a transport protocol.

    You need to learn these things before you post.

  • Re:IMAP (Score:2, Informative)

    by AlterEager (1803124) on Tuesday September 07 2010, @09:07AM (#33497146)

    Afaik Cyrus does not support server-side mail searches

    Oh yes it does. Check out squatter.

  • Re:IMAP (Score:3, Informative)

    by vanyel (28049) * on Tuesday September 07 2010, @07:56PM (#33504124) Journal

    I am using dovecot and thunderbird, and have about 60 "live" folders, some with 10's of thousands of messages (a couple with 150+k messages). It is a constant battle with thunderbird, which often goes away for long periods of time, even when not doing anything one would expect to be dealing with the larger folders.

    I'm working on some scripts to archive messages into 30-90-180 day archive folders to keep the live folders down to a manageable size, but it would be nice to find something that already exists...

  • Re:RETARD MODERATION (Score:3, Informative)

    by insertwackynamehere (891357) on Wednesday September 08 2010, @01:46AM (#33505654) Journal

    Do you know what imap is? He's gonna have to have some central storage thing but the mail access is platform independent..yeah if he wants his imap server to be his own than he'll have to pick one os to serve from but every nonshit mail application has imap support from desktop to mobile and hands down gives him what he wants if he takes the time to organize and set it all up

Make it myself? But I'm a physical organic chemist!

Working...