Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Hardware

High Speed Text To Digital Aquisition? 7

K asks: "As a long time subscriber to many magazines(Linux Journal, Dr. Dobbs, Byte, etc.) and as a person who can't throw away anything written for fear it might one day be useful, I was wondering if anyone knew of any high speed scanners that would allow me to archive these magazines for (possible) future use. My house is just getting too clutered and I would hate to lose these data. In addition, are there any legal questions about this. I have purchased the magazines so I believe that I should be allowed to archive them in any fashion I choose. Is this wrong?"
This discussion has been archived. No new comments can be posted.

High Speed Text to Digital Aquisition?

Comments Filter:
  • Sell them on eBay. Give somebody who missed them on the newstand (newsstand?) a shot at them before you make them irretrievable landfill fodder.

    Books and magaazines, including the ads, are an experience in and of themselves, and as convenient as a CD with all the text in editable copyable form might be, it just ain't the same.

  • Some magazines do this. Science [sciencemag.org] is a high-visibility example. Free access to digital articles for subscribers, and others can purchase on-line access.
  • German computer magazine C't [heise.de] provides a digital version of the magazines of one year on CD-ROM for an additional DM 10 (USD 5) for subscribers.

    Unfortunately, this is not the norm. Other magazines (like Spiegel [spiegel.de]) provide a digital version, but this one costs DM 260 (USD 130) - for anyone, also for subscribers.

    I think it would be best to convince publishers to offer digital versions (or access to their online archives) for a small fee, at least for subscribers. They shouldn't have to pay again for the same content. The scanning is just too much work.
  • Well, I'm working on a project right now, actually, that involved high volume document imaging. The company is looking at imaging several thousand pages per day on an ongoing basis, but then again, they're also buying $15,000 Kodak department document imaging scanners.

    There are a couple of problems.
    (1) Very high speed scanners that are used in document imaging are often only b/w. So, you loose all of the nice pretty advertisements.
    (2) You obviously have to cut apart all of your magazines on the spine to get them to go through an auto feeder.
    (3) The paper is likely to get jammed because its glossy and so thin.
    (4) You'll spend better than $5,000 on any scanner that will do better than 10 ppm. And then you probably can't get duplex scanning.

    But, if you want a solution that works great, look at a software package called Ascent Capture from Kofax (http://www.kofax.com ), and throw in a Kodak 3500 document scanner or better.

    Or if you want a solution that is affordable, invite all of your friends to bring their scanners over and give them beer and pizza while
    you all have a scanning party. It should only take about two hours to get through a years worth of Linux Journal.

  • I've got a Fujitsu M3093DG scanner, and it fits the bill for exactly what you intend to do. On Fujitsu models, the D is for Duplex and the G is for SCSI. They also make an M3097DG which does 11x17, or two full pages side-by-side. Panasonic makes duplex non-flatbed scanners, but they don't support grayscale.

    Quick specs:
    27 ppm (duplex, so 54 images/minute)
    50 sheet document feeder
    4 seconds for a flatbed pass (@ 400dpi b/w). This is great for stuff you don't want to
    chop. (works best with one person changing the page while another hits the scan button.
    up to 400dpi b/w or grayscale (which is hard to find), 600dpi b/w with an 8MB SIMM.

    It's SCSI, so you don't need the optical interface cards, like some high-end duplex scanners. But, I think they only have TWAIN drivers for windows.

    You can find them on eBay for $1000-$2500. Retail, they go for about $3500-$4000.

    I've had mine for a year and a half ($2200 on eBay, brand new), and have scanned about 30,000 pages without any problems. Some paper does give it feed problems, but then I just throw the next page in as the current one is fed in. It means I have to stand there, but it doesn't slow the process down.

    I use TypeReader for OCR (by ExperVision), but it works fine with Omnipage or any other TWAIN compatible program. I've found the 400dpi (over the standard 300dpi) does help reduce OCR errors, enough to offset the extra file size increase.

    The bigger problem that you will find is what to do with all the info once it's scanned. You can just keep the .tiff images, or OCR and export to .pdf, or OCR and export to text or some formatted text format. All this needs to be weighed on the time input versus information output quality/quantity. But, over time, it starts to pile up, so figure out what you want to do before you start.

    I personally use Folio Views (flat file-based, free-form database) to organize my data, with several perl scripts to add the heirarchy structure, and fix common OCR errors, etc. AskSam is another aption. Of course, a real database is also possible, but requires more know-how and time than I have.

    As for your question about legal concern for doing this, technically it _is_ violating the copyright. By taking it to electronic format, you are--for all intents and purposes--making a new "copy" of the material, which you didn't pay for. Also, the magazine publisher just might not want that stuff to be in electronic format. They have that right. By scanning it in without asking and receiving permission, you are violating that.

    Ryan
  • Your idea is excellent and I've been wanting to do the same with most of my books, not just magazines. I've also been looking at those high speed scanners on ebay, but I just don't have space for one. However, a cheesy approach is to use a digital camera. Even a 2 megapixel consumer camera will give readable (but not OCRable) images if you just want to be able to look things up rather than relaxing and perusing. The Toshiba PDR-M4 is a good choice because it cycles quite fast. If you're real serious you could use a professional camera like a Nikon D1 (now about $3.5K) which might perform better given the volume you want to do. But even the S100 is great for impromptu copying of a few pages (I use a portable copy stand that's easy to take to the library, but for home use you should get something more solid. If you have the space for it, though, the high speed scanner method is probably the best approach. I definitely want to do it that way sooner or later, even though it means chopping up a lot of books. Yeah, people are horrified at that, because books are such a great information presentation medium (take to the beach etc.) but they're a terrible storage medium. [nightsong.com]

A morsel of genuine history is a thing so rare as to be always valuable. -- Thomas Jefferson

Working...