Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Open Source Software

Ask Slashdot: Best PDF Handling Library? 132

New submitter Fotis Georgatos (3006465) writes I recently engaged in a conversation about handling PDF texts for a range of needs, such as creation, manipulation, merging, text extraction and searching, digital signing etc etc. A couple of potential picks popped up (PDFBox, itext), given some Java experience of the other fellows. And then comes the reality of choosing software as a long term knowledge investment! ideally, we would like to combine these features:
  • open source, with a community following ; the kind of stuff Slashdotters would prefer
  • tidy software architecture; simple things should remain simple
  • allow open API allowing usage across many languages (say: Python & Java)
  • clear licensing status, not estranging future commercial use
  • serious multilingual & font support
  • PDF-handling rich features, not limiting usage for invoicing, e-commerce, reports & data mining
  • digital signing should not go against other features

I'd like to poll the collective Slashdot crowd wisdom about if/which PDF related libraries, they have written software with, keeps them happy for *all* the above reasons. And if not happy with that all, what do they thing is the best bet for learning one piece of software in the area, with great reusability across different circumstances and little need for extra hacks? I'd really like to hear the smoked out war stories. It is easy to obtain a list of such libraries, yet tricky to understand whethe people have obtained success with them!

This discussion has been archived. No new comments can be posted.

Ask Slashdot: Best PDF Handling Library?

Comments Filter:
  • ReportLab (Score:4, Informative)

    by Anonymous Coward on Thursday August 07, 2014 @02:21PM (#47624691)

    Python only, but I've used it successfully.

  • pdf.js (Score:5, Informative)

    by AntiTuX ( 202333 ) on Thursday August 07, 2014 @02:30PM (#47624771) Homepage

    pdf.js is great for parsing and manipulating pdfs.

    I can't go into great detail as to how I've used it (Still under NDA), but it's rendering and manipulating of pdfs is pretty darn good.

    As for converting office formats to pdf, your best bet is to use office automation. It can be built to scale up, but it needs a lot of work to do so.

  • by jab ( 9153 ) on Thursday August 07, 2014 @02:33PM (#47624797) Homepage

    I've found these tools useful, with an honorable mention to gnupdf. I've never used it personally, but the code looks pretty solid. That said, when I really needed to produce great multilingual PDF I pulled out the PDF spec, gritted my teeth, and generated it directly.

    leptonica - turn images into PDF
    tesseract - turn images into searchable PDF
    qpdf - linearize PDF for random access over HTTP
    jhove - basic validation
    jhove-pdf-a - validation with better compatibility guarantees
    pdftk - command line tool for splicing pages together or apart
    ttx/FontTools - tool for modifying custom fonts
    reportlab - python library, easy to use but works best with Latin scripts

  • Re:PDFLib (Score:4, Informative)

    by Anonymous Coward on Thursday August 07, 2014 @02:33PM (#47624799)

    PDFlib is cheap compared to licensing Adobe's libraries from DataLogics. (speaking as one who switched from the latter to the former).... A full source license for pdflib and tetlib were much less that Adobe/DataLogics non-source license... less than 1 FTE. Then again, your milage may vary.

    PDFLib happens to be the cleanest and best PDF code solution I've ever worked with.

  • by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Thursday August 07, 2014 @02:51PM (#47625017) Homepage

    At least on the C# side of things, the three libraries I've used (iTextSharp, PdfSharp, and Aspose.Pdf) are all a bit of an unintuitive mess with inconsistencies all over the place and very little documentation. In the case of iText, their revenue stream is putting all their documentation into a book for people to buy, so it's not uncommon to get an intentionally vague response when asking for help.

    I cycle between each depending on what I need to do, because they all have their own quirks and supported features. I've even piped from one to another to get certain parts of the process working.

    Good luck.

  • try... (Score:4, Informative)

    by Anonymous Coward on Thursday August 07, 2014 @02:54PM (#47625057)

    sudo apt-get install ghostscript pdftk poppler-utils

    ghostscript: /usr/bin/dvipdf
    ghostscript: /usr/bin/pdf2dsc
    ghostscript: /usr/bin/pdf2ps
    ghostscript: /usr/bin/pdfopt
    ghostscript: /usr/bin/ps2pdf
    pdftk: /usr/bin/pdftk
    poppler-utils: /usr/bin/pdffonts
    poppler-utils: /usr/bin/pdfimages
    poppler-utils: /usr/bin/pdfinfo
    poppler-utils: /usr/bin/pdfseparate
    poppler-utils: /usr/bin/pdftocairo
    poppler-utils: /usr/bin/pdftohtml
    poppler-utils: /usr/bin/pdftoppm
    poppler-utils: /usr/bin/pdftops
    poppler-utils: /usr/bin/pdftotext
    poppler-utils: /usr/bin/pdfunite

  • Re:pdf.js (Score:3, Informative)

    by Auction_God ( 1056590 ) on Thursday August 07, 2014 @03:31PM (#47625373)
    No...it does not simulate clicking. It uses the underlying COM representation to perform its functions. That said, it does not work well in a multi-threaded environment, nor where you can't setup user (e.g. restricted web server credentials). So you either have to impersonate or use COM+ configuration to run the office tool under a different user name. So if you're just starting out...do not use Office automation in a server environment unless you're willing to deal with these issues. Try Aspose as suggested instead.
  • One word: PDFLib (Score:5, Informative)

    by Qbertino ( 265505 ) <moiraNO@SPAMmodparlor.com> on Thursday August 07, 2014 @04:08PM (#47625671)

    PDFLib GmbH [pdflib.com] (german LLC) build exactly one product: PDFLib. And they've been doing that since 1997. AFAIK the company was run by one guy - the initial developer - alone for most of the time. Now it's probably a shop of 5 or so.

    So it's not FOSS - yeah, that's a real shame. But the devs get to eat, you can demand service and response if you run into a bug and you can expect a good product and with PDFLib you're probably going to get it too.

    I haven't come across a single project doing non-trivial PDF stuff that doesn't use PDFLib. I've used it myself a little, and the cookbook that comes with the product was very good, so it comes recommended.

    My 2 cents.

The one day you'd sell your soul for something, souls are a glut.

Working...