Best Way To Archive Emails For Later Searching? 385
An anonymous reader writes "I have kept every email I have ever sent or received since 1990, with the exception of junk mail (though I kept a lot of that as well). I have migrated my emails faithfully from Unix mail, to Eudora, to Outlook, to Thunderbird and Entourage, though I have left much of the older stuff in Outlook PST files. To make my life easier I would now like to merge all the emails back into a single searchable archive — just because I can. But there are a few problems: a) Moving them between email systems is SLOW; while the data is only a few GB, it is hundred of thousands of emails and all of the email systems I have tried take forever to process the data. b) Some email systems (i.e. Outlook) become very sluggish when their database goes over a certain size. c) I don't want to leave them in a proprietary database, as within a few years the format becomes unsupported by the current generation of the software. d) I would like to be able to search the full text, keep the attachments, view HTML emails correctly and follow email chains. e) Because I use multiple operating systems, I would prefer platform independence. f) Since I hope to maintain and add emails for the foreseeable future, I would like to use some form of open standard. So, what would you recommend?"
IMAP (Score:5, Informative)
An IMAP server (dovecot, cyrus, courier) of your choice for Linux. If you don't have a Linux server you can always run it inside a small VM.
Not Much (Score:2, Informative)
It isn't particularly platform independent (because no one is paying much attention to Windows), but Not Much offers threads and full text search:
http://notmuchmail.org/ [notmuchmail.org]
Gmail? (Score:5, Informative)
While not open source, Gmail has a good search engine that isn't sluggish. Plus it has roughly 7.5 gigs of space to store data. Use IMAP to push all of your emails to the server and then use that Gmail account for archive email only.
Re:IMAP (Score:2, Informative)
Maildir (Score:5, Informative)
Maildir [wikipedia.org].
And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread
Good IMAP Server (Score:5, Informative)
citadel (Score:4, Informative)
citadel at www.citadel.org is a full pop3/imap server with full-text indexing.
Thunderbird can use server-side searches to find messages, and I find that works pretty well.
Making things easier. (Score:1, Informative)
Re:Maildir (Score:4, Informative)
mairix is a useful addition to a maildir setup: http://www.rpcurnow.force9.co.uk/mairix/ [force9.co.uk]
Kmail for Outlook stuff and Search. (Score:2, Informative)
Re:Not Much (Score:4, Informative)
+1
Notmuch can manage absolutely insane amounts of email without any artificial 'archiving'. Of course, if you are looking for a a program that does something else than tagging and searching (like sending, composing or receiving email), you need to look elsewhere.
Re:IMAP (Score:3, Informative)
Anyhow, look into Spotlight on OS X. Ooops, you said "open sourced," right? Damn. I don't know, then.
Re:Maildir (Score:2, Informative)
Maildir [wikipedia.org].
And if you have an e-mail client that don't support it, use an IMAP server to feed your client. /thread
With the proviso that you probably want to break up your archives in something akin to the following format:
. 2009
. . Q1
. . . Sent
. . . Received
. . Q2
. . . Sent
. . . Received
[...]
. 2010
. . Q2
. . . Sent
. . . Received
Lots and lots of messages in a directory can cause problems with many file systems. If you have more than say ~8K or so messages in a folder, I'd recommend breaking it up. At work this is what I do at work (CY/Qx/Sent-Received), which also allows me to move entire quarters into PST files when I hit my quota, but also keep a bunch of stuff online for access via web mail. Instead of quarters you can also do "halves"--H1 (Jan-Jun), H2 (Jul-Dec)--and then Send/Received with-in that.
For my personal account, I generally haven't broken 4K messages per year yet, and so simply have (CalendarYear/Sent-Received). I currently have everything going back to about 1997 or so.
Except for sorting mailing list traffic, I don't use folders besides Sent & Received: new messages go into my Inbox or a list folder. If I want to keep the message, it always goes into Received regardless of where it originally arrived. I try clear out the mailing list folders daily so there's no build up (select-all, delete). If I want a message that I got, I know it's going to be in one of the Received folders, and so don't have to digging in a dozen different places trying to figure out some sort of classification system.
Work is Exchange/Outlook; personal is IMAP.
Echo chamber... (Score:5, Informative)
...has me doing a "me too!" to everyone telling you to use IMAP + maildir; I use dovecot myself, complete with self-signed SSL cert (curse you firefox!).
El_Muerte_TDS [slashdot.org] has just pointed me towards mairix [debian.org], a dedicated maildir + friends indexing system which I've just tried out, and seems to be ideal for my use - fast email search has always been a good thing for me, but I've rarely found a nice lightweight indexing solution that was catered only to mail; "desktop" search engines tend to take the opinion that if I want one thing indexed then I automatically want everything indexed, and also insist on running around the clock. Much nicer for my needs to just have one little lightweight indexing program that only runs when I want it to.
Best thing about mairix IMHO is the way it creates a virtual maildir on the fly using symlinks, so not only is it easily viewable on the command line, it's also automatically compatible with all of those IMAP + maildir clients out there... which, last time I looked, was all of them. Useful hack for KMail users here [netmanagers.com.ar].
Disclaimer: my IMAP server has all its databases on an SSD, so even full text searches from the client are pretty speedy (seriously - the lack of access times on small chunks of random data cuts down search times by at least an order of magnitude), but obviously mairix has the advantage of being able to scale to multiple users with >X GB mailboxes much easier than spending a fortune on fast storage.
Re:IMAP (Score:4, Informative)
Seconding this. I've been using Dovecot with Maildir on EXT3 for the last few years--my mailbox is about 25k messages, which I keep all in a single folder and use IMAP tags to organize into different virtual folders, much like Gmail's system but without the privacy concerns.
Dovecot's supplementary indexes makes everything extremely fast (tags, dates, etc), and anything it doesn't catch Thunderbird does, I can search my entire mailbox for a single word in less than a second. I lose my Thunderbird indexes whenever I move to a new computer, but that's just a matter of leaving the client up for a few hours.
Re:Maildir (Score:3, Informative)
mairix is good, but it has some warts and it is not under development anymore. Among other things, it can run out of memory, has problems with parsing certain multipart messages, and can't search for an IP address (or any other string with dot-separated tokens.)
It's about the best I've found, but I wish someone would pick up development and fix some of the issues. As time goes on, bit-rot is going to set in and mairix will get less and less useful.
Re:An Advertiser's Fantasy ... (Score:2, Informative)
Re:MySQL (Score:2, Informative)
I've been using hMailServer for a few years now, and it's free. It's an ISP solution, but has some great IMAP facilities like shared storage. I have over 120GB of IMAP data and it's not twitching. It has a MySQL Lite backend, and capabilties for web, pop, imap, AD integration, etc etc. Would recommend it.
Re:Kmail for Outlook stuff and Search. (Score:3, Informative)
++ the above, or Evolution - it also imports PST's and from there you can move it to Thunderbird for Windows. If you want uber searchability you could then upload the whole shebang to a gmail account that you sync offline via gears.
I personally would balk at having all that stuff online with google but hey that would be the best searchable option I know. You can also sync with your Gmail account via imap protocol if gears and the web interface is not for you. Problem with that is that you will lose the great search capability with Gmail.
Then again Thunderbird has some really cool search addons that might just take care of your needs altogether, plus it is platform agnostic - you can have it on BSD, Linux, Windows or Mac.
HTH!
Re:mbox + grep (Score:2, Informative)
I second that, and I also use mutt for this because it's so damn fast.
1. cd Mail/old/
2. grep -c pattern *
3. mutt -f candidate-file
4. use 'l' commands with patterns on mail fields, e.g. subject, from/to, body
5. view limited message set in thread-sorted mode
6. tag messages of interest
7. save tagged messages to a small mbox, or attach them to a newly composed message and send
it takes mutt about 3 seconds to load a 280 MB archive file with 16k messages on my machine, and less than a second to limit the display by recipient or about 3 seconds again to limit by keywords in the message body. I used to make an mbox per quarter year, but then I started merging them into one per year, as well as going back and purging some of the largest attachments. (Mutt also makes this easy: sorting by message size, then selectively saving and/or deleting attachments. There's usually 10-50 messages in a huge archive which are dramatically larger than the rest, and dealing with these makes the archive much more manageable.)
Re:IMAP (Score:3, Informative)
No way to archive Entourage data with Spotlight.
Actually, there is:
Using Spotlight to search Entourage. [mvps.org]
Re:IMAP (Score:3, Informative)
Re:No, not IMAP, it bogs down (Score:3, Informative)
As anyone who actually uses IMAP can tell you, it bogs down quickly on large mailboxes, violating the poster's requirement about b)
Not true. Not absolutely false, either. IMAP is an access protocol, not a storage or indexing mechanism, and there is nothing inherent in IMAP that dooms it to be slow in handling large mailboxes. Different combinations of client and server, configurations, and mailbox content and usage can make huge differences in performance. Tens of thousands of messages in a single IMAP folder on a memory-lean server that uses Maildir storage on a UFS or ext2 filesystem with atimes enabled is going to suck horribly, especially with a client that doesn't cache heavily or maintain its own indices. Make that a mbox, and it will work great until you start trying to change it every couple of seconds.
Re:Domino (Score:3, Informative)
I call you out troll. I also call you out on your made up problem of not knowing if something is read or not. Unread marks replicate between servers.
Re:Gmail? (Score:1, Informative)
I used to Google eMail Uploader. It's perfect! Recreate tags instead of Oulook subfolders and gets all contacts in the same time. It took~2h for +4000 emails splitted between Outlook and thunderbird. At the end it just makes a warning for only 2 emails attachments (on +4000).
Now I'm happy to have everything at the same place in the cloud.
Re:RETARD MODERATION (Score:1, Informative)
You are wrong.
POP3 is a transport protocol.
IMAP is a transport protocol.
You need to learn these things before you post.
Re:IMAP (Score:2, Informative)
Oh yes it does. Check out squatter.
Re:IMAP (Score:3, Informative)
I am using dovecot and thunderbird, and have about 60 "live" folders, some with 10's of thousands of messages (a couple with 150+k messages). It is a constant battle with thunderbird, which often goes away for long periods of time, even when not doing anything one would expect to be dealing with the larger folders.
I'm working on some scripts to archive messages into 30-90-180 day archive folders to keep the live folders down to a manageable size, but it would be nice to find something that already exists...
Re:RETARD MODERATION (Score:3, Informative)
Do you know what imap is? He's gonna have to have some central storage thing but the mail access is platform independent..yeah if he wants his imap server to be his own than he'll have to pick one os to serve from but every nonshit mail application has imap support from desktop to mobile and hands down gives him what he wants if he takes the time to organize and set it all up