Multi-page PDF To Multi-page TIFF and Archiving? 125
GeorgeMonroy writes "One of my clients has aperture cards that they have been scanning into multi-page PDF files — but now they want them in multi-page TIFFs instead. One of the reasons they gave for this is that TIFF files require less storage space. While that is true, I wonder if TIFF is the best format going into the future. Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway. Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one? If another file format is better than TIFF, then are there any programs for batch processing that you can recommend?"
Are these things images or documents? (Score:1, Interesting)
If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.
I would suggest that your client doesn't know WTF they want.
Re:Are these things images or documents? (Score:5, Informative)
If they're images, then you should use TIFF (or perhaps PNG). However, it doesn't make sense for them to be "multi-page." If they're documents, then PDF is appropriate.
Multi-page TIFF is well supported in the industry. There is nothing "weird" about it. It even supports embedded, searchable text (a Microsoft addition, but something that actually adds value). PDF archival can be difficult to do correctly. At the very least you want to use a product which supports PDF/A, followed up with some serious validation to make sure the results are actually compliant. Otherwise you may get bitten decades down the road. Searchable TIFF, on the other hand, will be around for freaking ever.
Re:Are these things images or documents? (Score:5, Informative)
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
I assume that he means that it doesn't make sense for an image to be multi-page. In other words, it should be a single page and let the printer driver/program work out the paging. Of course, if they are scanning into PDF in the first place I would assume that they aren't images.
Re: (Score:1)
Oops, I actually followed the link to find out what an aperture card is and I see that I'm wrong- they are basically engineering drawings so they are images (with text annotations). But that doesn't explain why they are multi-page when a PDF page can be any size you want.
Re: (Score:2)
Re: (Score:1, Interesting)
... PDF archival can be difficult to do correctly. At the very least you want to use a product which supports PDF/A, followed up with some serious validation to make sure the results are actually compliant...
Daily I convert mass amount of PDFs to multi-page or single-page TIFFs, and on a daily basis I come across PDF errors. Unless you do extensive and exhaustive error checking\trapping, I would archive in TIFF format. The TIFF format is considerably less prone to errors within the document. TIFF is widely used and accepted, especially in the United States Court of Law.
Re: (Score:2)
I don't understand why anyone would move from tried and true pdf systems for a TIFF system unless they want to lose text search for employees, regulators and the public. Handing people image based pdf instead of text files or normal pdf is a standard practice for the Bush administration that borders on criminal obstruction of justice.
TIFF is the old established "tried and true" (haw) standard, PDF is the new hotness. So I doubt anyone is moving backwards here.
Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).
Re: (Score:2)
Also courts have requirements for electronic document formats and there is nothing non-standard about a "image based pdf" (these also support searchable OCR full text).
Re: (Score:2)
You understand we're talking about document archiving, right? Postscript is a terrible format for raster images.
Re: (Score:2)
Your paranoia is amusing. I wasn't "slamming" TIFF, merely pointing out that the fact that the spec is open and available is no reason to believe that all files conform to that spec. I was not recommending TIFF in all of its many incarnations, but in specific forms. I suppose I should have been more explicit. Use basic TIFF, with a 0=white photometric, with G4 compression. Stick to that and any viewer will open it.
If you'd like to see some actual pimping of the product I ACTUALLY work on, see www.swiftvie
Re:Astroturf? (Score:4, Informative)
Re: (Score:3, Interesting)
I was not recommending Microsoft's TIFF+Text, but rather some easily maintained combination of TIFF+text. If that is Microsoft's extension, then that's what it is. I see no loss of functionality due to the storage format not being vector-based. An archival format is going to be looked at on the screen, searched, and possibly printed. There is no benefit to vector graphics except perhaps a size argument, which is obviated by technologies like JBIG2 (if they'd ever get off their butts and formalize the JBIG2-
Re: (Score:2)
A scan? Do you have some suggestion how to convert a SCAN to a vector format and magically synthesize information that was never there in the first place?
For raster archival, 600 DPI is good enough. Nobody is suggesting archiving rasters at screen resolution.
Re: (Score:2)
If you OCR the text and determine that there is no image data you can disard the raster data and compress more heavily leaving the text as vector data.
That's a form of magically synthesizing vector data from raster scans.
Re: (Score:2)
Don't OCR (Score:2)
OCR doesn't work that well, you will still need the images. So you won't gain anything from doing TIFF+text.
Re: (Score:2)
Yes, it makes sense for them to be multi-page. TIFF is a multi-page format -- it actually supports a huge number of subformats (and different compression algorithms) in addition.
Re:Are these things images or documents? (Score:4, Informative)
TIFF: Thousands of Image File Formats
If you do wind up converting to tiff, then remember to document everything in excruciating detail. With thousands of possible combinations - each of which is a perfectly valid tiff image - you may encounter some issues if someone's using a less robust reader and assuming for the wrong compression algorithm, byte order, data striping or photometric interpretation.
Re:Are these things images or documents? (Score:4, Informative)
Bah. Use libtiff, document that only readers which also use libtiff (>= the version you're using) are supported, and you're done.
Re:Are these things images or documents? (Score:4, Insightful)
Having the format support a thing, and having that thing make sense are two different things. For example, Excel supports being used as a database... but does it make sense?
My point: use image formats for images, and document formats for documents. If the things you're trying to store are images, don't put them in a document format, and if they're documents, then don't put them in an image format.
Also, if TIFF is designed to store both images and documents, then I question whether it is too general to do either of them well. And your mention of "subformats" makes me think my concern is well-founded!
Re: (Score:2)
Sure, it makes sense. Look at faxes -- they're multiple pages of bitmapped images. Does that compose a "document"? Maybe. Is it something TIFF is good for? Absolutely! There are standardized tags used for storing extra information about fax transmissions in TIFF documents, and there's a great deal of software which makes use of that metadata. The same thing is true of archiving documents composed of scanned images -- there are a great many tag types associated with metadata such as scanner model and configu
Re: (Score:2)
Ah, so TIFF is really general. In that case, saying you're storing something "in TIFF" doesn't really say much, does it? It's kind of like saying you're storing it "in XML" -- what does that mean, if you don't specify a schema too?
Re: (Score:2)
If I get a multi-page fax, I don't want it delivered as a slideshow -- I want a PDF or a multipage TIFF. Yes, PDF is a sensible choice in this situation -- but what basis do you have for saying that multipage TIFF isn't? It's widely used in that case, and there's software support for the tagging for
Re: (Score:2)
How are those options better? How is TIFF bad?
TIFF has direct support for the compression and image formats used by fax machines over the wire, so there's no transcoding needed to write a TIFF of an incoming fax (and an outgoing fax can be preprocessed for sending while keeping it in TIFF). Tools that you and I already have installed (ie. tiffinfo) can provide caller ID and other useful metadata. There's a complete toolchain bui
Re: (Score:3, Insightful)
> Scanned documents are both images and documents.
Exactly. I don't mean to be rude, but the GP's comment is just silly.
"Documents" can be either graphical or text (or both). There's a reason why word processing formats allow embedding of images.
In fact, "document" has become an almost meaningless term since it can apply to so many types of data. About the only thing all the different "document" formats have in common is that their content is 2D!
Re: (Score:2)
Re: (Score:3, Informative)
Both PDF and TIFF handle multiple pages, and have done so for years.
Either would be suitable for this application.
If you really want to convert these, Imagemagick would be the best tool to use.
However- it seems a little daft for storage space to be the main reason for changing: you're simply exchanging one compressed image format for another. You may save 10% e.g. if you move from JPEG in the PDF to PNG/similar in the TIFF but is that really worth the effort?!
If they really want shiny TIFFs, it would be eas
Re: (Score:3, Interesting)
Imagemagick does it's conversion entirely in-memory, so if a document is more than a hundred pages or so you are going to have to have some problems.
Re: (Score:3, Informative)
I agree that unless the files are extremely huge or extremely numerous then storage space probably shouldn't be a concern because its cheap compared to your time and getting cheaper. But if storage space is a concern then you might look into the tiff format used by the patent office. Apparently it uses a form of lossless compression taken from fax machines and gets much better compression than many other common formats on black and white(no greyscale) documents. If it's the patent office's standard archive
Re: (Score:1)
I used to work at a company in the imaging/scanning/OCR market for *cough-large-printer-companies-cough*. The best format to save the images, IMO, is LZW Tiff format. PDF is fine and all for storing images, and provides the compact file size. However, you can do LZW Tiff images, and are able to convert Tiff images into other formats easier than trying the same with PDF. On that same note Tiff still retains all the image layers just the same. I've been out of the loop on processing techniques, but if you wan
Mac OS X has that funtionality in Preview. (Score:1)
Apple was kind enough to build this functionality into Mac OS X in the form of Preview and Automator (or Apple Script).
Ghostscript (Score:2, Informative)
Ghostscript [wisc.edu] can do the conversion from the console.
You can write a simple shell script to convert all files.
Re: (Score:2)
Solaris 10, (and I would presume Linux as well) has a package of TIFF utilities.
We didn't do too much pdf2tiff (most stuff came in as TIFFs), but I don't remember any major issues the other way (tiff2pdf).
I imagine a fully featured converter application would do a better job though.
ImageSite (Score:2)
Re: (Score:2)
DjVu (Score:4, Informative)
The most effective compressors are commercial, but DjVu [djvu.org] is a very effective image archival format; see DjVuLibre [sourceforge.net] for the non-commercial tree.
Moving back towards the question in the article, I don't think there's much worry about either TIFF or PDF in terms of future proofing; they're both very widely used, have multiple implementations and third parties with substantial interest in keeping those implementations maintained, etc. The quality of TIFF implementations varies wildly, but the good ones are only going to get better, and I'd be shocked if libtiff ended up terminally bitrotted without a successor implementing a superset of its functionality inside my lifetime.
Nature of microfilm content? (Score:1)
Re: (Score:2)
Tiff is better (Score:2)
But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.
Re:Tiff is better (Score:5, Interesting)
But size does not have anything to do with it. TIFF is far simpler in structure than PDF and has therefore better compatibility. TIFF is also well documented. Of course, they would have to use raw tiff to get the advantages. The storage-space argument is secondary and matters only insofar as larger data sets have a higher irsk of corruption.
I dispute the "well documented" claim. The TIFF standard is quite clear. Unfortunately, almost nobody adheres precisely to the standard. I work extensively with TIFF and PDF, and I have to say that the consistency I see in PDF is about 100 times more than what I see in TIFF. Your typical TIFF reader will contain thousands of hacks and workarounds for oddities that are produced by major players in the industry. While there is slightly non-compliant PDF, I have never seen things that even begin approaching the strangeness I see in TIFF on a daily basis. Having said that, I recommend TIFF plus search text metadata for archival, not PDF.
Re: (Score:1, Interesting)
I'd have to agree with that - I keep bumping into nasty combinations of old style JPEG in TIFF images combined with Wang annotations and highlighting - the choice of viewers that will cope with both of those at once is pretty limited
Re: (Score:1)
> Having said that, I recommend TIFF plus search text metadata for archival, not PDF.
Can I ask why? You whole post seemed to slam TIFF - that it's too varied from the actual spec. How does that translate to being a good archival format? Or are you just saying, use TIFF if you follow the spec?
Re: (Score:2)
Or are you just saying, use TIFF if you follow the spec?
Yes, I stopped my explanation too soon. If you stick to what's actually written (not implied by the existence of viewers that are already out there) in the Baseline TIFF specification, your file will be viewable everywhere. As far as text metadata, the curse is also the beauty, because TIFF easily lets you place tagged data in a file and it will be ignored by any reader which doesn't understand that tag.
MS's searchable TIFF is your typical MS cre
Don't do this. But if you insist, here's how. (Score:5, Informative)
Are TIFFs better than PDFs for future use? I wonder what format you think would last longer. Are there any other formats that you think would be better or more future-proof? To me, storage is not a good enough reason to go to TIFF, because storage prices are always dropping anyway.
Don't use TIFF. Stay with PDF. PDF is what all the big digital libraries are using. It's a proper standard, it's readable and writable by lots of free open source software, so even if Adobe disappears in a puff of intellectual property, you'll still be able to read your documents.
TIFF, on the other hand, is a container format (like AVI, but worse). It isn't fully supported by every program - what sort of TIFF do you want, anyway? Compressed with LZW? With RLE? Not compressed at all? There's free software that will read and write the most common types of TIFF, so you can certainly do it, but why give up the convenience of using PDF?
Also, since they already have many of these files in PDF format and they want to convert them into multipage TIFFs, are there any programs that you can recommend that will perform batch processing of files so that we do not have to convert each PDF one by one?
Use ghostscript. Use something like the following command line:
This turns input.pdf into a series of 300 dpi tiff files, one for each page, called output01.tiff, output02.tiff, etc. Change the DEVICE to get a different sort of tiff file, and use gs --help to get a list of options. You can easily wrap this command in a script of almost any sort to make the process fully automatic.Re: (Score:2)
Yeah, not to mention that PDF supports several compression formats for images including zip, jpeg, and jpeg2000 at various quality levels if storage is such a concern. Just playing around with the image from the wiki article, file sizes range from 4795kb for uncompressed, to 285kb for maximum quality (but lossy) jpeg2k. A jpeg compressed tiff is about 660kb.
Now, I've never had to deal with aperture cards IRL, so I'm not sure about the following: a lot of space seems to be wasted to preserve a few pieces of
Re: (Score:2)
Anything a TIFF can do, a PDF can do better. I can take the very same images that you have in your TIFF, compress them with a better algorithm and embed them into a PDF, and end up with something smaller.
You were clearly utterly incompetent in the generation of PDF's is all I can conclude.
Re: (Score:2)
Comment removed (Score:3, Informative)
Re: (Score:1)
http://www.imagemagick.org/script/binary-releases.php#windows [imagemagick.org]
Re: (Score:2, Informative)
Re: (Score:1)
Media or format? (Score:2)
In any case, the PDF and TIFF file formats are well-documented, and if ever even their widespread use makes them to be extinct (bloody unlikely), it would always be possible to write a program to convert them into the format-du-jour, provided, of course, you are able to read the media...
PDFs... (Score:3, Informative)
Acrobat has batch processing, and can convert pdfs to TIFF, JPEG, PNG and more. That would be my suggestion if you are really going to convert to TIFFs.
Re: (Score:1)
Re: (Score:2)
Don't forget that TIFF has a 4GB limit, due to all the file offsets being encoded as 32 bit values. If a TIFF tag can not store the data directly in the tag body, it stores the location of the data as an offset from the start of the file.
Re: (Score:1)
Re: (Score:3, Informative)
Re: (Score:2)
Or I could argue that storage costs will go up over time, not down, so your argument over file size is moot. A properly made pdf will have a reasonable file size.
Re: (Score:2)
pdf2tiff.sh (Score:4, Informative)
let's not reinvent the wheel -- I did this about 9 months ago //wolfmann -- and this code is Public domain (done on federal gov't time):
# cat pdf2tiff.sh
#!/bin/bash
for file in */*.pdf #for each pdf /dev/null
do
filename=`echo $file | cut -d'.' -f1`
if [ ! -e "$filename".tiff ]
then
echo "gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=$filename.tiff $file"
gs -q -dNOPAUSE -dBATCH -sDEVICE=tiffg3 -sOutputFile="$filename".tiff "$file" 2>
else
echo "$filename.tiff exists! skipping..."
fi
done
Re: (Score:2, Informative)
#!/bin/bash
for file in `find . -name '*.pdf'`
do
filename=${file%%.*}
This will cause recurse into all subdirectories and it will work on files with multiple dots in the name (eg. applicant.12345.cv.pdf)
Re: (Score:1, Informative)
Re: (Score:3, Interesting)
This code will not work on files with spaces in them, because the back-ticks will expand spaces without escapes while the shell globbing used in the original code will escape them. Since most PDF titles I've seen use spaces in their names, this is important. The rest of those modifications will help, though.
Were I coding this script, I'd write two (since xargs can't use a function). One for the inner loop, and one for the outer loop (or if only running it once, I'd do the outer loop on the command line
Re: (Score:2)
Oops, there should be only one percent sign in the target="${file%.*}.tiff" line. This enables your use of "User Manual 3.7.pdf" -> "User Manual 3.7.tiff" instead of "User Manual 3.tiff" as the above AC posted (and I copied). Also, the second script should have quotes around the first argument to find, which should use @ instead of 1, which respects multiple arguments and the spaces within them while still defaulting to the current working directory when there are no arguments.
I've incorporated this
It doesn't matter. Neither is great. (Score:4, Informative)
How *much* smaller are these TIFFs anyway? TIFF is actually a container format, and can support all sorts of compression, some of them proprietary, some of them common. Not all of them are lossless either (TIFF-Jpeg is a perfectly valid combination, and was used before the days of Exif to add metadata to jpegs). TIFFs can also include vectorized data. It's not all that much less complicated than a PDF.
PDFs are also a container format to an extent. You could very well have a TIFF embedded in a PDF. Fortunately for us, the PDF specification is a bit more stringent on what is supported and what isn't, and PDFs tend to work just about everywhere (especially if all that you've got is an image). You can also apply all sorts of compression to PDFs to reduce their file size... these might not be quite as well supported.
Both formats are extremely common, and it's extremely unlikely that you'll ever have to do any sort of conversion to display them. If I had to place money on it, I'd wager that PDF will be in widespread use for longer than TIFF, though neither format seems to be going anywhere anytime soon. You're more likely to have to worry about the storage devices you're using and the longevity of the media.
If you just need to store lossless raster images, PNG might be a good bet. It's a "Free" format, and is officially endorsed as an ISO/IEC standard. TIFF is copyrighted by adobe. It also has the advantage of being a complete image format, rather than just a container, which means that any software that can open a PNG image should be able to open *any* PNG. Because of its open-sourceness and widespread adoption, PNG will be around for a long time to come as well. Once again, the storage medium and filesystem that you use to store the images is very likely going to become obsolete before the file format itself.
Granted, PNG's compression algorithm isn't optimized for photographic data, though the image formats that *are* optimized for this purpose are neither common nor free.
In summary, there's no reason that a PDF needs to be terribly larger than a PDF (the overhead should be especially negligible if you've got lots of images at a high-resolution). Neither format is going away anytime soon, but both have quirks that can hurt you in the future (Multi-page TIFFs are even somewhat of an oddity today). If you really want small files and future-proofing, go with PNG. Otherwise, it's more or less a non-issue.
Re: (Score:2)
PDF/A (Score:5, Insightful)
When you discuss archves you think about looong times. Typically 10 to 50 years of retention with the odd exception where eternity is desired.
Hence "plain" PDF is probably even worse than TIFF. One problem here are the included resources (fonts) and references (http links) which are mostly left out in order to save disk space. The other problem is that there are so many "plain" PDF versions to choose from and none of them will last 10 to 50 years.
However, PDF is a good technology and therefor the PDF/A standard was developed. It is designed especially to deal with loooong term issues, is currently readable through almost any PDF reader and will be maintained by most sensible PDF readers for the years to come. There is NO vendor lock-in, you can put text in a PDF/A document an run searches against it. But most importantly, NO propitiatory stuff can be shoved in as it would result in an invalid document (a PDF document maybe but not a PDF/A document.)
With the price of current disk space you should NOT make file size a defining criterion in your archiving policy. Only on z/OS disk space comes at absurd and ridiculous prices. If you can, try aiming for an archiving solution on Unix, Linux or even Windows.
I am in the archiving business. At the moment PDF/A is the only format suitable for archiving.
Speaking as an enterprise search specialist... (Score:3, Insightful)
A graphic image of text is like a wax apple - it looks and tastes like a replica.
Re: (Score:2)
You can OCR stuff, store the text in the PDF.
Re: (Score:2)
Re: (Score:2)
Well ... while I doubt it's doable, it would be really cool with a PDF->LaTeX program. Transform everything in the pdf into LaTeX files (except pictures). Would give you free text search and an easy way to compress the stuff really well.
Re: (Score:2)
PDF is the right answer (Score:2)
PDF makes sense for document signing, security, and damage detection. TIFF does not have any of this important security and data integrity protection by itself.
PDF also allows for the same compression on the scanned image that TIFF does, as well as much better compression methods available to it.
TIFF, while well-understood in the archival industry, has rather fledgling support in the free *NIX world--especially multi-page TIFF.
Finally, with PDF, you can preserve both the image and the OCR data all in the s
Tangential question - can TIFFs be dig. signed (Score:2)
I have a similar issue, but have chosen PDF because they meet the digital signature requirements of most professional licensure boards (architects and engineers worry about this stuff). It's not a large hurdle, just that the documents can be externally verified against a publically available key. Adobe lets you do that for free (well, assuming you have their s/w; I can post a key on my website for a 3rd party to install and verify the signature).
This isn't a high-crypto requirement area - you can easily fa
Re: (Score:2)
Any file can be digitally signed with GPG or PGP. If your customers are used to looking for public keys on your site anyway, you might as well make them PGP pubkeys.
Re: (Score:2)
Even the pirate groups do that.
They'll have the "goods" and a md5sum.txt with the md5sum of the "goods" and a GPG signature around the md5sum AND the gpg signature of the file. Then they zip it.
PDF is the King for a reason. (Score:2)
Re: (Score:2)
And I'd be careful trusting the pdf optimizer for for file size reduction. Not always the best results. But I'd still put my money on PDF.
Another reason to do go tiff (Score:2)
We have a document management system. In it we had to make the decision for PDF or tiff. We opted for tiff. It had nothing to do with the file size. The deciding factor was because we could find FAST tiff viewers all day and night. It's probably not that PDF as a format is that much more bloated, but the readers, especially acrobat reader take a LOT longer to start up.
We use an activeX control called alternatiff to view them in the browser (and yes, it does multipage) The control loads in the browser
Re: (Score:2, Informative)
You are so right about Adobe Acrobat.
For a stand-alone free PDF viewer take a look at Foxit.
http://www.foxitsoftware.com/pdf/rd_intro.php [foxitsoftware.com]
It's fast and small, and does the job.
Re: (Score:1)
Not free as in gratis or libre. Even that honky free offer is more BS.
Re: (Score:2)
How about this one then?
http://blog.kowalczyk.info/software/sumatrapdf/ [kowalczyk.info]
Re: (Score:2)
Tried Foxit on my network just a couple months ago. Uninstalled it from the test machines after a week. Too many PDF's either wouldn't open or caused Foxit to crash.
Can I make a suggestion? (Score:2)
Send them a quotation. If the money looks good, do it and don't bitch about it on slashdot. If it does not look good, decline the job and don't bitch about it on slashdot either. Either way, don't bitch about it on slashdot.
Go to a service bureau (Score:2)
Legal service firms work with all of these PDF and TIFF variants all of the time. They should be able to kick out whatever you need at x cents per page (which will usually be cheaper than your time/money)
The weird TIFF formats are used for various document management products, so it really depends mostly on your workflow.
PDF2IMG (Score:2)
The gold-standard tool for this is PDF2IMG [datalogics.com] which uses Adobe's own PDF rendering library but it'll set you back a few thousand dollars.
Ghostscript is good but it isn't perfect: it does choke on some PDFs, misrenders some and won't pick up non-embedded TTF fonts, only external PS fonts. It also doesn't do any anti-aliasing so you probably want to render large and sample down and (IIRC) there's a max image size it can render. But by and large it does just work.
Re: (Score:2)
It can do antialiasing [wisc.edu]
Shell script (Score:3, Informative)
Re: (Score:2)
Batch processing (Score:2)
Converting (Score:3, Insightful)
I hear a lot of talk about how to convert back and forth, but nobody's mentioning the thing that I would consider the most important:
When you convert from .png to .tif, are you losing data?
Most of these convert scripts seem to work by starting Ghostview and rendering a .tif out of your PDF. This is a *terrible terrible idea*. What you'd really want to do is reach into the PDF itself, and extract the lossless images perfectly. Anything else is like printing the .PDF and scanning the printout - you might lose pixels, you might gain extra pixels, and you almost certainly won't be perfectly aligned with the "pixel grain" of the original image.
Unless you can guarantee that you'll pulling out, pixel-by-pixel, the exact original data, I would stick with PDFs.
Why bother? (Score:2, Informative)
As someone who writes software to view PDFs [icesoft.com], I can tell you this is completely pointless, since anything that saves scanned documents into PDF is really storing it as a TIFF image inside of the PDF anyway. The PDF container adds useful features for metadata, and is well documented [adobe.com], so shouldn't add any future-proof issues. And the overhead is probably a few kilobytes.
Max file size favors PDF (Score:3, Informative)
TIFF files have a maximum size of 4GB. (The "value offset" field of an IFD entry is a 32-bit value.) You can exceed this with 50 noisy pages. PDF files have a maximum size of 10 to the tenth power bytes. (The byte offset in a cross-reference table entry is a ten-digit decimal number.) That's 2.3 times the maximum TIFF file size.
I have written software to create both TIFF and PDF files. I would use PDF for archiving. Even today, it's tricky to find a TIFF reader that will run on all the important platforms and handle the variety of compression flavors (e.g., JBIG2.)
netpbm (Score:1)
Implementations in minutes. Converts to most anything. Not the most efficient though
A couple applications... (Score:2)
I won't speak to why or whether you should do it, but here are a few options for how.
Doculex has an app called MPTiffIt [doculex.com] that will do single to multi or multi to single page tiff conversion. You'd need to convert the PDFs to single page tiff via Save As (or perhaps a Batch Process), then recombine them with MPTiffIt.
Or, you could use a Tiff printer driver along with a batch printing software.
Personally, I'd use L.A.W. [imagecap.com] (Legal Access Ware, created by Image Capture Engineering and now owned by Lexis Nexis), wh
decode (Score:2)
While you're at it, why not decode the data punched on each card and then just store the microfilm image and the decoded data, discarding the image of the rest of the card? That'd make things a lot more efficient.
Re: (Score:2)