How Would You Archive Mounds of Genealogy Data? 73
dexter riley asks: "Hello, all. My mother, a librarian, historian and genealogist for over twenty years, died about a year ago. She left a huge amount of genealogy information, culled from books, magazines, and the internet, mostly in the form of typewritten, photocopied, and printed pages. My main goals are: Preservation - converting the documents into a compact format that can be easily copied and transferred to others; and Indexing - making it possible for someone else to easily find the documents referring to a particular person, family, place, or document type (like land, marriage, military, birth or death records). To this end, I would like to convert her work into a format that can be stored digitally and scanned for keywords, to make it easier for others to use this information for their genealogy projects later on. What tools do you recommend for handling a project of this size?"
" I'd estimate there are at least 10,000 pages of documents in all. Much of it is organized by binder into family groups, but a lot of it is unorganized, loose paper. Besides being an irreplaceable resource for any future genealogists in my family, there are other researchers working on related lines that may find some part of this data useful. At the very least, I would like the satisfaction of keeping some part of her work from being lost for a few years more.
Here's a general list of things that I've determined I would need:
- Scanners: What flatbed scanners would you recommend for fast, high-resolution scanning of documents?
- Image formats: What lossless image formats would you scan your original documents into?
- OCR software: Although OCR is not perfect, would you recommend using it to allow keyword searching to the original document? If so, which software would you suggest?
- Document Indexing: In addition to OCR, are there other tools (document tags?) that you would use to help classify and organize images and other digital documents?
- File organization software: Ultimately, many thousands of text and image files will be generated. Since I don't want to just convert a paper mess into a digital mess, what tools would you use to organize related image and text files?
That is SOME work... (Score:4, Informative)
http://www.archivebuilders.com/whitepapers/index.
2. Get two reasonable scanners that work with whatever software you choose. One with a document feeder (can be monochrome). Modern office MFPs work fine. The other one is a cheap flat bed scanner with color for anything the big one won't process.
3. Doc prep and Indexing will take much longer than the scanning - and unlike OCR, are a lot of manual labour. Expect a couple of weeks, minimum, especially if you have't got an indexing scheme in place.
4. Use TIFF G4 and PDF (OCRed text over the images).
5. Profit.
Re:Scanner throughput - tests (Score:1)
Scanning 2500 sheets at 600dpi with a MFP that my current emplyer builds (no names, sorry
Re:more specs for consideration (Score:1)
While scanning is a very profitable thing to do professionally, it might still pay to shop around for a professional scanning service. If you do the indexing and preparation yourself, cost will be significantl
Re:more specs for consideration (Score:2)
Re:Scanner throughput - tests (Score:2)
Dead tree (Score:3, Insightful)
Re:Dead tree (Score:5, Insightful)
After I give them a dopeslap for not keeping their data current.
I mean, my first Mac had a 40MB hard drive, but I still have all the data from it - it's become easier and easier each generation to copy all my old data forward.
Granted, there's always the odd lost-tape found behind a cabinet, but that's someone who didn't have a good data retention plan in place and didn't care about that data too much.
There will always be a need for forensic recovery, but compared with just a few years ago, almost all the casual users I know keep all their data on a hard drive. The floppies and ZIP's are gone. Some of it is on CD-R, but that's the new backup media, not current storage.
Now getting them to do a good backup so I don't have to go rescue their drives with dd_rescue - somebody let me know how to do that!
Re:Dead tree (Score:2)
When your starting point is a Mac that has a 40MB hard drive, your perspective is understandable.
Some of us have/had data on old 8" floppies or 5.25" floppies, back from the days when a 5MB hard drive was $2,000 and the size of a Dell server.
It wasn't so easy to keep data current in those days. If you were lucky, you had a serial port, and could tranfer to a later generation machine. But disk formats were not standardized the way they are t
Re:Dead tree (Score:2)
For me it's not, but going forward the easy-to-read data model is standard. So _today_ you have no good excuse for keeping your data in a non-digital manner. Heck, I can hook up an IDE, SATA or SCSI drive via PCI controller, USB controller, or Firewire controller all with ease. Tell that to the MFM drive I still have sitting in the corner to dump.
(If you or anyone knows a cheap service to grab data off of
Re:Dead tree (Score:4, Insightful)
Frightful.
I was looking at 100 year old newspapers from a small town in AL about 5 years ago. Yellowed, brittle, crumbly. Half of those old newspapers were useless to amateur pawed geneologists who were slowly contributing to their demise by even attempting to turn the pages.
Then, there's the tons of valuable paper records that get passed to random descendents of record keepers (eg, Grandpa has a bunch of records of marriages, births, baptisms from some old church from 60 years ago that doesn't exist).
If you make dead tree records I'd recommend making multiple copies that get distributed to different people in different places, preferably on acid-free paper.
The biggest enemies of geneological records, IMHO, are
Re:Dead tree (Score:2)
Good choices, I'd add:
Re:Dead tree (Score:2)
Organization (Score:4, Insightful)
You might also want to ask the guys from rotten.com if they'll let you see the code behind the nndb.
Re:Organization (Score:1)
Re:Organization (Score:2)
I definitely want to find a document management software/structure that will let me maintain what order exists, while making it possible to add structure to th
An Intern (Score:2)
Oh... You will probably find a wiki sort of thing these easiest to deal with (sort of a DB front end for dummies)
Re:An Intern (Score:2)
Of course he doesn't expect her to do it all herself. He'll get her [ghostbusters.net] to help.
OCR software: ABBYY (Score:3, Informative)
Mormoms can help (Score:5, Informative)
Re:Mormoms can help (Score:1, Interesting)
This articl
Re:Mormoms can help (Score:2)
Re:Mormoms can help (Score:3, Informative)
Re:Mormoms can help (Score:2)
Re:Mormoms can help (Score:3, Interesting)
Although they might not be able to take that level of detail into their central database, I think they would welcme info like that at a more local level. At the local levels, they often have small libraries of
Re:Mormoms can help (Score:1)
Mormons? (Score:4, Informative)
This link [familysearch.org] may or may not be useful.
Family history organization (Score:1)
Re:Family history organization (Score:1, Informative)
Re:Family history organization (Score:1)
Re:Family history organization (Score:2)
I'm kinda interested in how that works, do you have any links?
Re:Family history organization (Score:1)
It is from the LDS Church site because I prefer to find out information from the perspective of the believers in an openminded way than to entertain every outraged misconception of the populace.
(read: I'll never find out what a Toyota owner loves about his car from a Ford dealership)
LDS baptisms for the dead [lds.org]
Re:Family history organization (Score:2)
from reading that site, it doesn't seem entirely bad.. more like one of those things people mention without actually explaining to get a rise out of people.
Cyndis Lst (Score:1)
Google (Score:4, Interesting)
You'll have it forever, and anyone will be able to pull it up.
(Too bad they don't offer this service... yet.)
Re:Google (Score:2)
If there is significant enough value, pawning off some of the work to Google or the Internet Archive or something similar isn't a bad idea. In particular, many libraries and Universities already do this kind of work.
Tools (Score:3, Interesting)
Document Indexing/File Organization: A Wiki is the proper tool for this job, in my opinion. It makes it very easy to edit, and hyperlinking is instictive. You can easily attach documents to pages, you can usually export the whole thing as a directory tree. Most Wiki software also keeps track of all of the versions of a page, so you can worry less about making bad mistakes.
I've used both MoinMoin [wikiwikiweb.de], which is a traditional web based Wiki, and WikiDpad [sourceforge.net], which is an IDE environment for Windows that does Wiki-like things. Both of these programs are open source, Python based applications.
You also might want to check out ThumbsPlus by Cerious Software [cerious.com], which stores thumbnails of images in a database (including SQL backends), along with keywords and user fields. It can help you as well.
--Mike--
Re:Tools (Score:2)
Storage? (Score:2)
Fort Wayne Library Genealogy (Score:2, Informative)
The Fred J. Reynolds Historical Genealogy Department of the Allen County Public Library was organized in 1961 by the library director for whom it was named. The department's renowned collection contains more than 300,000 printed volumes and 314,000 items of microfilm and microfiche. This collection grows daily through department purchases and donations from appreciative genealogists and historians. Because of the collectio
In response to your questions... (Score:2, Interesting)
Paper, then plain textfiles, then ... (Score:2)
As for digital format, I suggest you consider plain (ASCII) text. It's likely to outlast more complex formats.
I've tried some specialised genealogy software, and it looks like a mixed blessing. It usually forces the data into pre-ordained types, and quite a lot of
Re:Paper, then plain textfiles, then ... (Score:2)
Google says search, don't sort! (Score:3, Interesting)
If you agree with that philosophy, then, after you have it all in ASCII, just do a full text index of the data (which makes sense if the data is rarely or never updated) and it is quick to pull out anything you need.
Fire. (Score:3, Funny)
File Organization (Score:2)
While it's a somewhat ugly text-based, flat-file format, it does permit organization of information in ways that will be useful to genealogists and researchers.
Re:File Organization (Score:2)
Even better, pretty much any genealogy software can read and write GEDCOM files. Think of GEDCOM as the CSV of the genealogy world. It even has rudimentary facilities for storing multimedia files in an encoding similar to Base64. I don't know if the various software packages commonly support it, though. (My wife's the family genealogist. I
Re:File Organization (Score:2)
I think many of the recent ones do. (Of course, there are still the products that don't really support GEDCOM at all, but they're the minority.)
Then there are a lot of hybrid solutions (like used by phpGedView, for example) which will support only "external" media.
I suspect that, eventually, they will all support the embedded media.
Here's what I do (Score:2)
http://avdcs.com/prod/cards/fb1200.html [avdcs.com]
Although you won't usually print anything at higher than 300(ink jet) or 600(laser) dpi, I recommend that for archival purposes to go ahead and scan at full-color 1200dpi from within Photoshop or something.
I would make sure to backup all your originals onto a DVD DL (Dual Layer) drive as you can now get the drives for about $50 and they can use CDRW, DVD+/-RW or DVD DL disks a
First things first (Score:3, Informative)
The most important thing you can do with a pile of data is turn it into useful information. I know that's what you're trying to do. In the case of genealogical data, the most important is not just an electronic version of paper sources, but something built from those sources: an electronic family tree with facts (birth date/place, death date/place/cause, graduations, marriage date/place, immigration, military service, etc.) attached to each person and SOURCES attached to each fact, and all of it exported to GEDCOM format.
For each source, good software will let you include transcripts (not scans but transcribed excerpts that aren't too big) along with the bibliographic reference that tells others where they could find the information for themselves.
I probably have as much of this sort of data as you have, but most of it has been "processed" by extraction. That leaves the rest as backup that I seldom need to refer to. Since 99% of the information (by VALUE, not by bits) has been extracted into a very useful form, the rest is referred to so rarely that paper is just fine (as long as there is another copy stored elsewhere).
I'm not saying that having electronic copies of all of the original paper sources wouldn't be better. It would, but mostly for ease of backup. It's just that the most valuable thing to do (in my opinion) is the extraction and assembly of a really rich information structure (family tree, facts, sources, transcriptions of sections of sources) in a standard interchange format (GEDCOM).
After doing so, the additional value in having electronic, searchable copies of the original documents gets much smaller, so if I were you I'd make sure the information extraction project was done first before embarking on the scan/OCR/indexing project. After you do the former, you may decide that the latter isn't worth the effort.
Re:First things first (Score:1)
This allow you to attach notes to each individual thus as much extra information as you want. You can also attach images, scanned docuemnts, pictures etc...
the program is for windows, but is very inexpensive ~$20.00.
this would allow you to organize all the information with all relations and family tree informat
GEDCOM (Score:3, Informative)
As for hardware and other software, I would suggest you use what is familiar to you, or is compatible to the software package you have chosen.
Put it on the web. (Score:2)
It is run by the Church of Jesus Christ of Latter-Day Saints. Future access will pretty much be a given and it will help others their family history.
GEDCOM (Score:2)
Have it photographed (Score:2)
DOCUMENT Scanners (Score:2)
I am extremely happy with my Canon Canon DR-2080C [canon.com]. Note: It is the onl
File Archival (Score:2)
As for organization, just come up with a simple, sane way to name things (like
Don't, unless you have the same goals (Score:2)
I think the best way for you to use this information is to first develop an interest in genealogy yourself. Genealogy is an art as much a science. What you see as only mountains of raw data the genealogist sees as potential relationships between any one piece to any other. Some examples:
Census Forms. If you come across a series of census forms, how are you going to organize them? Are they by year? By location? By primary line? By associated families? (Which may be the only connection between the primary
Proof oriented information gathering (Score:2)
This is very true, but there are few genealogist who are working according this principle, and even fewer software packages that are supporting this way of working. It seems that most programs are only about recording the conclusions of your research.
Even the GEDCOM format is not really designed according to these principles. A good genealogic pro
Re:Proof oriented information gathering (Score:2)
Very interesting thought. I occasionally pass along thoughts to the developer's of GRAMPS [gramps-project.org], I wonder how an existing/traditional app could be modified toward's this better methodology without stranding any of the traditional users.
Hot on the heels... (Score:2)
Hot on the heels of this... I believe the linked thread resolved that digital formats are doomed to fail because they are unreliable over long term (ie > 5 or 10 years). Anywho, the thread might help you with the storage and backup aspect :)
Re:Hot on the heels... (Score:2)
LINK [slashdot.org]
Re:Hot on the heels... (Score:2)
Once a newer/better/more reliable storage medium comes around, it takes an hour or two (usually) to transfer the data to the new media.
Formats however are fine. GIFs, TIFFs and JPGs
Digital preservation (Score:1)
Given that this problem is quite hard to solve in the long term (although much easier if just for the short term), it would probably be better to donate the material to an organisation with the resources and longevity to secure it.
Over the long term, you will have to migrate storage media every few years. You will also have to migrat
GEDCOM (Score:2)
You need Document Management (Score:1)
I use ScanSoft's (www.scansoft.com) PaperPort ($100), which is about as cheap as you can get for document management on a decent level... it has basic OCR software built-in, although they sell OmniPage for about $150 (I think), for better OCR capabilities.
T
Digital preservation issues, GEDCOM, etc. (Score:1)