Slashdot stories can be listened to in audio form via an RSS feed, as read by our own robotic overlord.

 



Forgot your password?
typodupeerror
Data Storage

Ask Slashdot: Open Source For Bill and Document Management? 187

Posted by timothy
from the seasonally-appropriate dept.
Rinisari writes "Since striking out on my own nearly a decade ago, I've been collecting bills and important documents in a briefcase and small filing box. Since buying a house more than a year ago, the amount of paper that I receive and need to keep has increased to deluge amounts and is overflowing what space I want to dedicate. I would like to scan everything, and only retain the papers for things that don't require the original copies. I'd archive the scans in my heavily backed up NAS. What free and/or open source software is out there that can handle this task of document management? Being able to scan to PDF and associate a date and series of labels to a document would be great, as well as some other metadata such as bill amount. My target OS is OS X, but Linux and Windows would be OK."
This discussion has been archived. No new comments can be posted.

Ask Slashdot: Open Source For Bill and Document Management?

Comments Filter:
  • by roman_mir (125474) on Sunday April 07, 2013 @02:45PM (#43385563) Homepage Journal

    Send them to a dedicated gmail account. You'll be able to find all of your documents (you can label them, whatever) and they provide online office of some sort and if you forget what you have there you can always just go to Google search and push "I feel lucky" button.

    • by Anonymous Coward on Sunday April 07, 2013 @02:56PM (#43385609)

      Providing quick and easy access to the government (and who knows who else) to all of your important documents.

      • Re: (Score:3, Interesting)

        by roman_mir (125474)

        Absolutely, no question about it. Some documents are not that important, but the important ones shouldn't go there.

      • by DeBaas (470886)

        to overcome this we worked for a while on Tagnlock [github.com]

        It is supposed to do the following:
        - allow you to easy label /tag documents
        - gui way of defining tags, if they are free form or not, compulsorary etc.
        - encrypt and then store via method of choice (email included)

        I am afraid that it currently is abondonware...

        You can also:
        Find some way to tag and use duplicity [nongnu.org] to make encrypted incremental backups to a cloud service. That's what I do now. I simply use duplicity to duplicate (and encrypt) to the same drive and

    • by Rinisari (521266)

      I'm concerned with privacy of backing up to Gmail, even if its labeling is completely what I'm looking for. I suppose I could encrypt everything I send and base its subject on something I can read and label, but that's a lot of rigmarole for something that I really would rather keep locally or on my own backed-up network.

    • Re: (Score:3, Insightful)

      by fustakrakich (1673220)

      Google is pretty fickle with its applications. We'll never know how long gmail will remain online, until they decide to shut it down.

      Oh, like the other replies said, 'privacy'... You will have none if it is online in any form.

      • by Genda (560240)

        Not necessarily so... Google (or any cloud storage resource) is an awesome place to store encrypted and compressed documents. You just want to make certain that you back everything off every once in a while so if Google (or other resource) decides to pull the plug, you won't find yourself trying to slurp 5 GB of data down in a week through a limited resource being crushed by a hundred million other users doing the same.

  • by mkro (644055) on Sunday April 07, 2013 @02:47PM (#43385569)
    I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.
    • by AvitarX (172628)

      Hasn't kde finally gotten their shot together for functioning tags?

      • They have. If you wanted to do something like this and make it semantic and awesome you could create a new Nepomuk Ontology for your document and have some associated specific metadata - then you can semantically link it with people, bank accounts, etc, give it tags and more. Search and relationships are enumerated with nepomuk-indexer. On top of that you can either use a file manager like Dolphin and just use it for search, or build a little interface over the top to perform custom queries on the data.
    • by tomtomtom (580791) on Sunday April 07, 2013 @07:11PM (#43387145)

      I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.

      gscan2pdf is OK, but if you want to do this seriously then you're probably going to want a reasonably fast sheet-fed scanner (I got a Fujitsu ScanSnap S1500, which is supported by SANE and can scan at 18-20 pages/36-40 sides per minute) with a button so that you can go through a whole stack of paper quickly with minimal keyboard/mouse interaction to slow you down. This led me to setting up scanbuttond (which just gained official support for the ScanSnap but there was a patch floating around somewhere for a while before that) with a custom script.

      Make sure you OCR your documents to make them searchable then run an indexer (I like recoll [recoll.org] but KDE and GNOME both have their own desktop search solutions as well). I've found the best OCR engine on Linux seems to be tesseract [google.com], but there are a couple of others you can try. The process took me a while to get right and is a bit painful - the script which scanbuttond runs runs scanadf to scan to a string of image files per side and puts them in a processing directory. I then have another batch-processing script I run once I'm done with a pile of papers while I go and get a cup of tea which runs unpaper then tesseract on them, then hocr2pdf to convert each page individually into a searchable PDF file then finally pdftk to concatenate all the pages together into a scanned document. I split the two parts of the process out because the OCR bit can take some time and this way I can get maximum throughput on the scanner itself without needing to wait for the rest to catch up. If I could be bothered then I could make the scanning script run my de-batching script once only and have it pick up new files as they are dropped in the directory but it's not that much of an effort really.

      I then sort my PDFs into a hierarchical directory structure once they've been OCRd (and at this point they get indexed as well for searching).

      If you're on Windows/Mac then the software that comes with the ScanSnap will pretty much do all this for you; although it's better to scan with OCR disabled then use Acrobat to batch-OCR the PDFs later for the same reason. Add a decent desktop search solution like an old version of Copernic (or possible Windows Search) and all is good.

      • Hey, man, upload that script somewhere and somebody might add the watching logic to it!

        How do you keep your ScanSnap from not feeding multiple pages simultaneously? Mine does it frequently, making it mostly collect dust.

    • GScanToPDF can do OCR and embed the results as annotations within the PDF. Perhaps that would help with search ability. It works well enough with a lot of my documents though it is far from perfect it is good enough for those purposes especially for bills as they are not handwritten. Best results are on scans set to line art/b&w rather than grey scale or colour.

  • OpenKM (Score:3, Informative)

    by Anonymous Coward on Sunday April 07, 2013 @02:50PM (#43385585)

    OpenKM (http://www.openkm.com/en/) is what I use to manage my documents, its tagging and document preview features are what I appreciate most. It runs as a web-service, FYI.

  • muddle headed post (Score:2, Interesting)

    by Anonymous Coward

    by definition, "important" = keep original (I mean seriously, are u that short of basement space ??)
    Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?
    (i mean, how many times has /. gone over this - is this the editors idea of a yearly question ?)

    tagging is an inherently stupid idea; it may be the best that you can do with current technology, buta google like full text search is much

    • Basements aren't as common as you think they are. I've always lived in Southern California, and I've never lived in a house with a basement. At most, there's been a crawl space under the house, but that's not exactly a good place to store things. And, I suspect there's a typo in TFS that the editor's didn't catch: it says, "I would like to scan everything, and only retain the papers for things that don't require the original copies." and I think it should read, "...that do require..." because as written,
      • by Rinisari (521266)

        You are correct. I meant to keep only the things I need originals of: birth certificate, car titles, etc.

        As for physical space, I have better things than documents to store in my available basement space: wine, beer, computers, etc.

        • I don't even have the originals of my birth certificate, discharge papers or DD 214, and haven't in decades. However, my father registered my birth certificate at the Hall of Records, and I did the same with my discharge papers and DD 214 after I got out of the Navy so I don't have to worry. In fact, in Los Angeles, where they're registered, any veteran can get two copies of his service papers for free, any time they're needed, so why keep the originals? And, once when I was down there to request copies,
    • Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?

      This is just completely backwards. Electronic documents are the least likely to get lost or destroyed. I have no receipts or papers from 25 years ago. But I have all my email from those days. With e-docs, you can make multiple copies, store copies off-site, etc. Every email I have ever sent, every non-spam email I have ever received, all the source code I have ever written, over 10,000 family photos, copies of my marriage license, deeds, insurance forms, etc. etc. will ALL fit on a single XD card small

      • I think the parent poster was referring to a theoretical future digital dark age. Societal collapse, EMP, etc. Honestly though, if any of the aforementioned were to occur, you have bigger problems to worry about. So ya, I wouldn't sweat it.

    • by Genda (560240)

      M-Disk now finally allows you to make good archival high density storage (DVD.) Combine this with a good document management tool (like Docmoto on Mac) and you can pretty much be assured of managing all you paper to electronic needs elegantly. Additionally, a lot of HEAVY DUTY Document Management Applications (eg. Docfinity) have sophisticated Business Process Management tools included to control process flow for those documents. One cool feature for these tools is the ability to parse metatags from files n

      • One small detail to add... M-Disk is the best there is *if* you need or care about DVD-ROM compatibility, but for roughly the same price per disc, you can get non-LTH BD-R discs with roughly 5-6x the capacity. M-Disk is basically a non-LTH BD-R disc with the track geometry of a DVD-ROM. Either way, non-LTH BD-R and M-Disk are the way to go if you want long-term passive archivability (ie, the ability to write a disc, throw it in a box, forget about it for 25 years, and still be able to read it. While there a

    • Tags work just fine. Organizing files in folders is just one-dimensional tagging. Lots of people in lots of different areas use tags (Firefox bookmarks, Gmail labels, pictures/music). Tagging can also be automated (You can think of google search just as a really complex, automatic, learning tagging system). Of course, by all means keep the originals for important documents, but I'll not stand someone bad-mouthing tagging as a concept.
    • Most of what was written on papyrus was lost. Anybody interested in BCE history is going to have a list of documents they wish had been preserved. Many ancient documents were preserved by being copied and re-copied, which works just as well for electronic documents.

      Paper documents (coming up to the present) take volume, have weight, are hard to copy in bulk, and can be destroyed by many different things (fire, flood, rodents, as examples). Electronic documents are trivial to copy in large quantities a

  • This again? (Score:5, Funny)

    by turkeyfeathers (843622) on Sunday April 07, 2013 @02:53PM (#43385599)
    Similar questions to yours appear here regularly. The consensus is that it's best just to throw the bills and documents out and spend more time watching porn.
    • by AmiMoJo (196126) *

      In all seriousness most of them can just be thrown in a box. They are automatically sorted by date and chances are 99% of them will never be looked at again. That one time you need one you can just sift back through the pile, and you still saved a huge amount of time and money compared to having a proper electronic document system.

  • Try Alfresco (Score:2, Interesting)

    by Anonymous Coward

    You can try Alfresco DMS.
    It requires a webserver so it might be too-much for a single user.

  • by Idimmu Xul (204345) on Sunday April 07, 2013 @02:56PM (#43385615) Homepage Journal

    http://www.icyblaze.com/idocument/ [icyblaze.com]

    iDocument for the mac is like iTunes but for documents. It lets you import documents (pretty much any type) and tag them and store them in virtual or real folders, it sounds like it's exactly what you're after.

    • by Rinisari (521266)

      Thanks for this. It's pretty damned close to what I want, the sole exception being that it's not open source and not cross platform. I might go in on it anyway if I can't find something better.

  • My Workflow (Score:5, Interesting)

    by Orphaze (243436) on Sunday April 07, 2013 @02:57PM (#43385625) Homepage

    1) Receive document.
    2) Scan with Fujitsu Scansnap S1500 in about 10 seconds. $380 on sale, but so far worth it over cheap all-in-one scanners it's not even funny. Seriously, don't even bother going paperless unless you get a real document scanner.
    3) Save PDF to simple software RAID-1 mirror of two 2TB drives. (Takes about 5 seconds to setup from disk management in Windows.) This should protect against sudden drive failure taking everything.
    4) Backup nightly to external drive swapped off-site every other month. This should protect from accidental deletions, fires, etc. Bonus points if backup drive is ioSafe fire proof variety.
    5) Throw away original. Only exception is official documents like titles, marriage certificate, etc.. Yes, I even throw away W2s and the like. My taxes are 100 percent digital nowadays.
    6) Check and test restore from those backups on a semi-regular basis, and you're done!

    • I liked it up until you have windows managing a RAID. Get a RAID NAS running Linux. It seems odd to RAID up a couple of drives just to let windows mess them up. I suggest a Synology ds212. If you are really serious build a ZFS rig with snapshots.
    • by rastos1 (601318)

      I've been thinking about the same problem as the TFA recently. There is more to the task than simply scan and store.

      What about assigning tags to the documents? Fast previews? Searching based on time/time range/tags/fulltext? Grouping related documents? Annotations? OCR?

      I'm considering to write my own solution, but if there is something useful out there, I'd like to have a look.

      • by rastos1 (601318)
        One more thing I forgot: electronic signatures.
        • by sribe (304414)

          One more thing I forgot: electronic signatures.

          What about them? For scanning and archiving, they're irrelevant.

          • by rastos1 (601318)
            First, I want to be able after years to verify that the scan was not modified. Second: There are countries that do recognize electronically signed documents as legal documents (if signed with a certificate issued by state-run CA). I did not actually check with a lawyer if this fulfills the requirements, but ... why not to have the option?
            • by sribe (304414)

              First, I want to be able after years to verify that the scan was not modified. Second: There are countries that do recognize electronically signed documents as legal documents (if signed with a certificate issued by state-run CA). I did not actually check with a lawyer if this fulfills the requirements, but ... why not to have the option?

              For your own verification, OK. But no, no state-run authority is going to give any weight whatsoever to an image from your own archive that you signed yourself.

              • by rastos1 (601318)
                I can get a certificate from sate-run CA. That means I can authenticate myself to a state-run service. Why not have a state-run service that produces a signature for a document (or encrypted document or document digest) that I upload? That would be a great service. It would even take some workload off the notaries (which would certainly make them very "happy") ... I should patent that.
                • by sribe (304414)

                  Why not have a state-run service that produces a signature for a document (or encrypted document or document digest) that I upload?

                  Because it says absolutely nothing about the authenticity of the document which you provide.

                  You're talking about scanned documents--documents from other sources which you allegedly scan, allegedly without modifying them, before signing them. The only authentication anybody else would be interested in would be authentication by the document producer, not by you, because you could perform any amount of modification/forgery before signing the document.

                  This is all very different from documents which you produce

        • by Rinisari (521266)

          That's actually a good feature I'd not considered. As a document is added to the system, sign it using PGP and store the signature. That way, I have reasonable certainty that the document has not been modified since initial ingestion, or at least a warning that it may have been compromised if the signature doesn't check out.

          • by Bert64 (520050)

            If you store the signature in the same place, then anyone in a position to modify the document can simply generate a new signature too.

            • by tlhIngan (30335)

              If you store the signature in the same place, then anyone in a position to modify the document can simply generate a new signature too.

              Not without the key they can't.

              You're confusing a signature with a hash. The latter can be easily regenerated. However, a signature cannot as it requires signing the hash. Change the hash, and the signature cannot be verified anymore.

              It's why people are interested in breaking the hash algorithms - break SHA2-224/256/384/512 and anything signed with those algorithms is vulner

      • by Rinisari (521266)

        OP here.

        These features you list are examples of what I desire in a package that manages documents. I'm not as concerned with OCR, but that'd be a nice feature to have for the lengthier letters and such.

  • You don't need a CMS (Score:5, Interesting)

    by Anonymous Coward on Sunday April 07, 2013 @02:58PM (#43385629)

    So, I've been doing this pretty consistently for the past few years and sent this advice to some relatives asking basically the same question. (That's also why it's a little dumbed down.)

    I haven't found a case where any sort of CMS makes more sense than the file system. This is after doing this for about 10 years, and I've got records going back to '01.

    I'm using a Fujifilm Scansnap and a Fellowes Powershred, and running Mac OS X. OS X has decent indexing, a good file system manager (really can't beat column view) and the Preview app will let you reassemble PDFs, which is occasionally very handy.

    1. The enemy is copies. I strongly recommend "scan and shred", or you'll wind up scanning the same thing over and over.

    1.1. Don't bother with any scanner that doesn't do double-sided scans.

    1.2. Use a shredder. You can take things out of a trash can.

    1.3. The scanner should come with OCR software. Choose "Searchable PDFs".

    2. Do scanning in small batches.

    2.1. Create a folder "Scanned", and "Unfiled".

    2.2. The scanned files go immediately into scans, and the paper immediately goes into the shredder.

    2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.

    3. If it takes any work to scan it just shove it in a filing cabinet, or, better yet, just shred it.

    3.1. If you're having to use a flatbed, it's too complicated to scan and you should file or shred it.

    3.2. You can often get manuals and pamphlets and stuff online by googling part of the text or the product name.

    4. Don't scan anything you can get electronically.

    4.1. Most companies would much rather let you download bills and statements and such.

    4.2. Most of them will also delete those statements after a few months, so get in the habit of immediately downloading the statement.

    5. It's *very* helpful to put a date on everything. I generally do YYMMDD, trying to guess from dates I find in the document.

    5.1.If it's a document covering a period of time like a bill for the month of November, I use the ending date.

    5.2. For tax documents I'll put TT-YYMMDD, where TT is the tax year, since the actual transactions occur that year, but filing and IRS stuff happens the year after.

    6. I've found that even with full text search, you still need folders.

    6.1. They just don't need to be extremely complicated; usually two levels seems to be fine. I'll put prior years into separate folders, too.

    6.2. Your system will evolve as you work; just get it in there, and then be mindful of what you are commonly looking for.

    6.3. Keep books and reference manuals in a folder that doesn't get indexed. (Spotlight has an option for this.) They tend to create a lot of spurious hits.

    7. Keep your inbox clean, if an email wants you to download a statement, get it right away and put it in Unfiled.

    7.1. Likewise, keep your desktop clean, scan and shred stuff as soon as it comes in.

    7.2. Have a periodic to-do item to tidy your files, don't spend more than half an hour (tops!) at any given time.

    • by sribe (304414)

      2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.

      god, no! Give it a sensible name and put it where it belongs to begin with; don't deal with the same document multiple times.

    • by Rinisari (521266)

      Thanks for this. This is definitely a workflow I need to model.

    • by overlordofmu (1422163) <overlordofmu@gmail.com> on Sunday April 07, 2013 @04:17PM (#43386103)
      Disclaimer: I know this will seem pedantic but I am trying to get people to think about problems in the long term (solutions that work for thousands of years, not hundreds).

      If we use the format YYYY-MM-DD for dates (for instance 2013-04-07), they sort both alphabetically and numerically, they are easy for human eyes/minds to parse at a glance (my apologies to the vision impaired) and there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).

      Please see ISO 8601: http://en.wikipedia.org/wiki/ISO_8601 [wikipedia.org]

      Obligiatory XKCD: http://xkcd.com/1179/ [xkcd.com]
      • by melikamp (631205)

        Yes, yes, yes. The submitter sounds like he wants to digitize a bunch of files, so I would recommend a good file system. Any stable filesystem will do, like ext4 for instance.

        Avoid metadata within a file for as long as possible. It will bury you. If date and bill amount is all you need, then just stick them into the file name.

        YYYY-MM-DD.amount.unit.short-description.pdf

        2013-04-07.-3975.us-cents.how-much-this-advice-will-cost-you.pdf

        Now you can pile your files into, say, ~/my-files/ in any way whatev

      • >there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).

        I'm kicking myself for not having caught this earlier!

        Thanks (no sarc) for alerting us to the Y10E4 problem.

        Question: Is Linux Y10E4 ready?

    • by Anonymous Coward on Sunday April 07, 2013 @05:01PM (#43386357)

      4.1. Most companies would much rather let you download bills and statements and such.

      And this is exactly why I HATE all of the "e-bill" solutions that every company has dreamed up at the moment.

      They turn the problem from "the company remembers to SEND you a bill/invoice/paper" to "you have to go get the bill/invoice/paper FROM the company".

      With paper bills/invoices/etc. sent through the US mail, they "remember" to do something, and I get an automatic reminder when the envelope appears in my mailbox.

      With the e-bill solution, the most I get is an email reminding me to go log in and download the bill/invoice/paper. Now, notice what is wrong here. They just sent me a communication (hint, its the reminder email) that could have functioned identically to the USMail envelope of carrying the bill/invoice/paper along with it right to my inbox, so when I receive the email, I ALSO receive the bill/invoice/paper itself (i.e., attach the bill/invlice/paper as a .pdf to the email).

      Now, most companies will balk at that because "email is not secure" or "email is not private". Well, why don't you let me F****** upload a gpg public key to your system, and then your system could encrypt my bill/invoice/paper using my gpg public key, then attach it to the "reminder" email, and now we have an electronic system that functions identically to the old paper bill in the old paper envelope sent through the postal office.

      They remember it is time to send me my bill, they create the .pdf (electronic equivalent to printing the bill on paper), they encrypt the pdf (electronic equivalnet to sealing the bill in a mailing envelope, and they email me the item (electronic equivalent of giving the sealed envelope to the postal service).

      But does any company implement this system? No, not one.

      And so they will continue to mail me paper, and can continue to hound me to switch to "e-bills" all they like. But until their e-bills are done properly (as above) they won't get any buy in here.

      • by reboot246 (623534)
        Amen, brother! I prefer to get my bills in the mail. Real, honest to goodness paper.

        On the outside of the envelope I write the date I received the bill, how much it is, and when it is due. It goes into my mail sorter in the first bin. When I pay it online I write the confirmation number from the bank on the envelope and then put the envelope in the second bin. I keep the envelope until the next bill from that company comes in. That way I can see if the previous bill was actually paid on time or if there wer
      • by LoRdTAW (99712)

        I don't understand the mindset either. To me Email is much more secure than snail mail. With email someone needs my password to gain access to my mail. With snail mail, all someone needs to do is open my mail box and walk away with my mail. You can make it a bit more secure by using a mail slot or lock box but a majority of people have the classic unsecure mailbox that is out in the open.

  • My suggestion would be to just scan and OCR your files, and then store them in your file system.
    Hierarchy might be something like: ~/scans/year/project/sorted

    Within each sorted subdir, you'd have three folders. Date, organizationThatGeneratedTheDoc and TypeOfDoc.
    So in the folder ~/scans/year/project/sorted/org
    The file names would be something like: organizationThatGeneratedTheDoc-yyyy-mm-dd-TypeOfDoc.pdf
    In the folder ~/scans/year/project/sorted/TypeOfDoc
    The file names would be like: TypeOfDoc-yyyy-mm-dd-organizationThatGeneratedTheDoc.pdf
    Etc.

    You'd use links (symlinks or hard links) to make sure that each document is accessible in more than one place. (You can also use links to put documents in more than one project folder.)

    Types of documents would be things like invoices, receipts, legal threats, court orders etc. In the event that a document has more than one type, or more than one organization, you simply have more links. So invoice-2013-04-07-webdevteamawesome.pdf and legalthreat-2013-04-07-webdevteamawesome.pdf are the same document, because the first page is an invoice, and the second a threat to take you to court if you don't pay. (This then exists six times, three times for each type, but with the magic of hard links only takes up the space of 1.001 documents.)

    With the OCRed text being saved with the PDF scan, you can also run text searches with in your files to find specific information (such as bill amount, seriously, how often would you use that information?)

    This allows you maximum flexibility, and prevents you from being locked into a particular piece of software (as you can do everything manually). Moreover, once you've got it setup, it's easy to run with each new document.
    Steps would be:
    1) Scan and OCR doc, saving the PDF into the staging area folder.
    2) Run your script, which asks for the date, project, org name, doc type.
    3) The script then saves the document in the appropriate folders, generating links as required.
    4) Profit!

    • I should note, you need to be careful to make sure you use the same spelling and wording for each org and doc type. You don't want to end up with Murphies, Murphy's, Murphy's Inc., Murphpy's Beer Company Inc. etc., each with invoice, inv., invoise and envoice.
      It would be better if your script forced you to pick a doc type, and showed a list of already existing companies.
      This applies no matter what solution you end up running with.

      Also, for documents that cover a period, you have multiple options. The first

      • by whoever57 (658626)

        Your suggestion is over-complicated IMHO. I use Xsane and scan as multi-page documents. Xsane allows me to add pages to the scan set and reproduce a new PDF file. There are some downsides to my method: I need to have an approximate idea of the date of the document that I am looking for.

        I generally file by //.pdf, although I may vary the hierachy if appropriate, for example: TAXES//.pdf

        Perhaps more important, though, is to extract the data into some form of record keeping (even if it is only a spreadheet) at

        • by whoever57 (658626)
          Arrgh... /. ate my filenames, even though I posted it as Plain Old Text:

          I generally file by //.pdf, although I may vary the hierachy if appropriate, for example: TAXES//.pdf

          Should be: I generally file by <TOPIC>/<YEAR>/<MONTH#>.pdf or perhaps <TOPIC>/<YEAR>/<MONTH#>/scans.pdf. I use other variations to the hierarchy if appropriate, for example: TAXES/<YEAR>/<Type_of_Form>.pdf. So all W2s for a particular tax year. would be in the same PDF file.

          All scanned

    • by AmiMoJo (196126) *

      You would think that in this day and age someone could invent an OCR system that has a basic understanding of the documents being scanned. It would automatically name files "Bank of Elbonia Current Account Statement 03/2013" and allow you to do things like search all statements for transactions over £250.

      In fact the bank could just include a big QR code on the back with all the data in it, but I suppose that is asking too much.

  • by Rob the Roadie (2950) on Sunday April 07, 2013 @03:16PM (#43385705) Homepage

    I've played with this a few times, never used it in anger though.

    http://www.mayan-edms.com/

    I might take up your challenge on going paperless too and give Mayan a go.

  • by Areyoukiddingme (1289470) on Sunday April 07, 2013 @03:19PM (#43385713)

    PDF is big and bulky. DJVU format makes for tiny document scans. And there are open source libraries for creating it, available even in Debian. Wavelet compression did finally make it into the wild. It's just nobody has ever heard of it, for some reason.

    Doesn't help for organization, but it should be a reasonable option for storage.

    It even embeds the OCR text in the document along with the image version, so it doesn't proliferate multiple copies of the same data.

    • You do realize that PDF can store the OCRed text along side (or above? it's another layer) the original scanned text.

      Also, because the reference software of DJVU is GPLed, it's never going to see widespread commercial use (as all the big software companies only want to take and take).

      A standard subset of PDF (e.g. PDF/A) is a much better option, and if you're worried about the amount of file space taken up, you can always use GZIP or ZIP.

    • And as counterpoint; I bought a 32 GB, class 10 MicroSD card with adapter for $22.19 USD yesterday.
      • Flash media, by Sandisk's own reckoning in one of their white papers, has a realistic lifespan of about 10 years (roughly 5 years until the first unrecoverable read error for MLC media, roughly 15 years until the first unrecoverable read error for SLC media, with both unlikely to be directly-readable by any consumer operating system due to accumulated errors after ~20 years and require professional data recovery (SLC takes longer to get to its first hard error, but once the avalanche begins, it accelerates

    • by Inda (580031)
      PDF only wraps around the PNG, JPEG, BMP, generic_image_format. The extra bloat is only a couple of kb.

      If the bloat is more, the PDF has been generated incorrectly.
    • by fritsd (924429)
      That's what the Internet Archive uses, isn't it?
  • I have been trying to do this for a while. I have a ScanSnap S1500M and have been hosting all the PDFs on my Synology NAS. However, programs like iDocument don't support network drives and text searching PDFs. They rely on Spotlight's database, and spotlight doesn't work on a NAS (though it supposedly does work on a Apple Server).

    I'd LOVE some sort of text searchable solution that is better. I do use iDocument, but that has a LOT of limitations, like it will not handle ePUBs. I'm hoping at some point Synolo

  • by bazorg (911295) on Sunday April 07, 2013 @03:37PM (#43385817) Homepage

    Maybe that Owncloud [owncloud.org] thing will work well to handle the storage and access. Anyone knows if its search function is any good?

  • Alfresco (Score:3, Informative)

    by Balr0g (960255) on Sunday April 07, 2013 @03:39PM (#43385829)
    I use the community edition of Alfresco [alfresco.com] for that task. You can tag all documents, add custom fields and have full text search and versioning out of the box. Documents can be accessed via web interface, smb, ftp and even imap.
  • My situation is the same, except that I move often, and have to keep legal documents for a few years (typically 5). I also have paper copies of invoinces and Bills (loads). I didn't want to have to lug boxes and boxes of paper, so I developed a script to do the following:

    1) Scan the document page by page, and save as tiff (300dpi)
    2) Run open source OCR on it, and save the resulting text to the tiff "comment" field on the metadata
    3) Save it in my file server.
    4) Index it with a desktop search program (here is

    • by Rinisari (521266)

      Please do post the script. Throw it up on pastebin, or, better yet, https://gist.github.com./ [gist.github.com]

      • And done :)

        https://github.com/ZivaVatra/SDAT [github.com]

        Figured I would take the opportunity to try out GIT (have not bothered so far).

        Also, seems that I have recently made it actually save to tagged PNG's rather than TIFFs. Forgot about that :)

        Hope it turns out to be useful to you. Let me know if you want commit privs for any fixes you do. Happy Hacking!

    • by beachdog (690633)

      Here are some pieces of a scan to ocr script I am developing.
      First I am scanning a multicolumn document and to preserve the sense of the document text, I scan even pages twice and odd pages twice.
      Second, the scanned images must be rotated. Pieces of the "convert" command appear in the perl fragments here.
      Third, I am using the open source tesseract OCR program. Some of my documents have grayed areas that contain text. So I am running tesseract twice on the source files and picking the output file with the mo

    • by kermidge (2221646)

      "I have been wondering whether it would be worth open sourcing the script...."

      Please do. Unless you deem it worthwhile to spiffy it up and try to make some moola, I think it'd be great to share your script. It could be useful to some, could be instructive to those wannting to learn, to see how someone else has done something; any possible embarrassment you might feel about it being 'a bit hacky' you might could toss off to 'having character'. Heck, after your description I'd like to see it, even tho I ha

      • As requested, I have put it on github now:

        https://github.com/ZivaVatra/SDAT [github.com]

        Hope it is useful to you, or at least interesting :)

        I don't think it is worth selling the script, it isn't that fancy, not to mention that then I would be on the hook for supporting it.
        Since the recession hit I've had to work 2 jobs, and I really don't have much time to devote to personal nerdy pursuits. Barely have time to sleep as it is :(

        All I can do is publish and hope it helps others. That little script has made my life a lot ea

        • by kermidge (2221646)

          Thank you. I'm enjoying looking at the script, am wondering where I can get a scanner I can afford, and had a look at your site as well. And thanks for the readme also. Now get some sleep. (grin) I've had two jobs at times, when I was younger, and it's decidedly not only no fun, but done too long a recipe for ungood juju. Good luck to you.

  • Thinking about this question, I checked the folder in which I keep research and notes for my primary area of study. It's 2GB and just under 2,000 separate files. Many of these are OCRed PDFs, some mp3, some .doc, .rtf. Mac OS X's indexing lets me do adequately quick find-by-content searches, and a relatively simple organizational schema for subfolders let me consult categories of data swiftly. I also use a reference manager program that probably has close to a 100 keyword tags, and Finder lets me get to stu
  • Just a couple questions come to mind:

    First: What is the purpose of keeping the information? If it's just to have a record for your own sake of what and when and how much, do you even need to scan the statement or receipt or keep the original? or can having all the info imported into a money manager be enough?

    I've been using Quicken for over a decade (still using Quicken 2000 actually as later versions are bloaty) to keep all my financial history in detail. For answering questions like "When did I buy th

    • by Rinisari (521266)

      The initial purpose of keeping the information is completion. I sheepishly admit to digital hoarding, and this may be feeding that desire. To me, it's easier to scan a document and tag it, rather than importing its information.

      I need to keep things like receipts for large purchases for insurance, expense, and warranty purposes, bills and account statements, tax documents, and even things like the rare paper letter I get (e.g. my former tax preparer died last year. If I were to be audited, I'd need some evid

      • There's a certain amount to be said for hoarding. There are quite a few documents that I'm unlikely to ever want again, but I may have need of. My current practice with the paper ones is to throw them into the appropriate drawer, since they're nicely compact in such a pile, and it'll likely be faster to do an exhaustive search a couple of times than to maintain a filing system.

  • I use gscan2pdf http://gscan2pdf.sourceforge.net/ [sourceforge.net] with my multifunction "printer" and then save the bills and documents in properly named and organized directories as pdf files. Simple as pie. (Why is pie simple?)

  • I use my iPhone to scan, convert to,PDF and upload to my Dropbox. The app cost me $6

    Dropbox will always be there and is backed up

  • I used to periodically sort things into hanging folders then dispose of anything nonessential after 3-4 years. A few years back I decided to switch to scanning. So I started collecting a pile of stuff to scan. In the intervening years, that pile has grown and grown and now the scanning would be such a big chore that I don't even like to contemplate it anymore.

    The simple fact is that most documents are not something you will ever need again so deserve the minimal effort you can put towards temporary mid-term

  • by buss_error (142273) on Sunday April 07, 2013 @05:35PM (#43386601) Homepage Journal

    Any open source library management software that does ebooks should help you out. Here's a list:

    http://sourceforge.net/directory/home-education/library/opac/os:windows/freshness:recently-updated/ [sourceforge.net]

  • (Longtime listener; first-time caller. If I'm doing it wrong, please be kind.)

    I've been going through the same issue and have painstakingly scanned/filed a metric crapton of old documents, putting them in a hierarchical directory structure where I can find them if I need them.

    But this sucks for a number of obvious reasons. The ones that bother me the most are:

    1) A scanned document is larger (*and* less useful than a downloaded PDF).

    2) It's a manual process! I'd rather spend ten hours automating

  • Super simple scanning system using Linux.
    Make directory called scans, make another called taxes
    Have a text file of scanning hints with an easy to remember name.
    in a terminal, print the scanning hints file and use the Linux mouse copy feature to construct a scan instruction
    The scanimage application requires sudo or you can find a tweak using google search to alter the scanner's USB files and make it run from an unprivileged user.
    cd scans
    cat filewitheasycommandstocopy.txt

    Typical contents of my hint file:
    sudo

  • Try out Alfresco, It is a nice document management system if you are familiar with IT-system.
  • I've been a houseowner and the rest for a while, and the amount of stuff you actually need to keep is not that great. I mean, who keeps old bank statements, credit card bills, invoices and receipts for more than a couple of months, unless they're business- or tax-related?

    I know it's probably different here in the UK where most people don't even need to do a tax return, but basically the really important stuff like house deeds (and wills) are in the hands of a solicitor anyway, and I simply don't need to k

  • This is not actually a real solution, I'm just amused every time I see document management show up in a slashdot thread - being it's not a very *exciting* field. I work for a company that provides document management solutions for much larger organizations, with costs (I think) starting in the tens of thousands of dollars, and going up to way more than that. Of course, since I work here, I have my own personal test repositories for testing things, and when I was buying a house and started getting crazy amou

  • Can anyone tell if you can train tesseract to be a bit better at recognizing a specific font?

    I'm using the Debian version but if you have a 300 dpi scan the OCR is often gobbledygook.

    (Yes, "use the source Luke" is also a valid answer in this case...)
  • Keep It Simple Software (engineer) or whatever...

    I went with the simplest possible solution. One that also allows me to recover even if a "database" becomes corrupted or obsolete, because all the "real" data is contained in the documents themselves.

    I just scan to PDF and add tags in the Keywords field of the PDF metadata. For the keywords, I use unique words that aren't going to show up in an actual document. (Just tacking on a prefix or surrounding each keyword in brackets is good enough.) I also organiz

"'Tis true, 'tis pity, and pity 'tis 'tis true." -- Poloniouius, in Willie the Shake's _Hamlet, Prince of Darkness_

Working...