Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Open Source Software

Ask Slashdot: Best PDF Handling Library? 132

New submitter Fotis Georgatos (3006465) writes I recently engaged in a conversation about handling PDF texts for a range of needs, such as creation, manipulation, merging, text extraction and searching, digital signing etc etc. A couple of potential picks popped up (PDFBox, itext), given some Java experience of the other fellows. And then comes the reality of choosing software as a long term knowledge investment! ideally, we would like to combine these features:
  • open source, with a community following ; the kind of stuff Slashdotters would prefer
  • tidy software architecture; simple things should remain simple
  • allow open API allowing usage across many languages (say: Python & Java)
  • clear licensing status, not estranging future commercial use
  • serious multilingual & font support
  • PDF-handling rich features, not limiting usage for invoicing, e-commerce, reports & data mining
  • digital signing should not go against other features

I'd like to poll the collective Slashdot crowd wisdom about if/which PDF related libraries, they have written software with, keeps them happy for *all* the above reasons. And if not happy with that all, what do they thing is the best bet for learning one piece of software in the area, with great reusability across different circumstances and little need for extra hacks? I'd really like to hear the smoked out war stories. It is easy to obtain a list of such libraries, yet tricky to understand whethe people have obtained success with them!

This discussion has been archived. No new comments can be posted.

Ask Slashdot: Best PDF Handling Library?

Comments Filter:
  • ReportLab (Score:4, Informative)

    by Anonymous Coward on Thursday August 07, 2014 @02:21PM (#47624691)

    Python only, but I've used it successfully.

  • PDFLib (Score:2, Insightful)

    by Anonymous Coward

    Well, there's PDFLib, which is hideously expensive and not open source, but if you're after a professional package for serious purposes that just works I can only recommend it.

    • Re:PDFLib (Score:4, Informative)

      by Anonymous Coward on Thursday August 07, 2014 @02:33PM (#47624799)

      PDFlib is cheap compared to licensing Adobe's libraries from DataLogics. (speaking as one who switched from the latter to the former).... A full source license for pdflib and tetlib were much less that Adobe/DataLogics non-source license... less than 1 FTE. Then again, your milage may vary.

      PDFLib happens to be the cleanest and best PDF code solution I've ever worked with.

      • You know whats funny is PDFLib is what Adobe calls the set of core tech libraries to generate PDF files inside their own apps. (source: used to work at Adobe - on Acrobat no less)

    • TCPDF [tcpdf.org] Open-source PDF-reader built in PHP
      FPDF [fpdf.org] Combine with TCPDF above to create a PDF-writer using PHP
      SetAssign [setasign.com] Not open-source but this company offers both free and paid libraries that combine with the libraries above to allow PDF encryption / decryption using PHP.- The paid versions support more complex ciphers and I swear by them personally

      Not sure if you meant desktop software or...
    • by josecanuc ( 91 )

      I second this. PDFlib is good software for making PDFs. Their TET tool for extracting text can return to you where (coordinates) each letter on a page is, if you desire, or just dump the whole page or each word at a time, etc.

  • pdf.js (Score:5, Informative)

    by AntiTuX ( 202333 ) on Thursday August 07, 2014 @02:30PM (#47624771) Homepage

    pdf.js is great for parsing and manipulating pdfs.

    I can't go into great detail as to how I've used it (Still under NDA), but it's rendering and manipulating of pdfs is pretty darn good.

    As for converting office formats to pdf, your best bet is to use office automation. It can be built to scale up, but it needs a lot of work to do so.

    • Office Automation is problematic -- because it literally opens up a hidden window of your Office app and simulates clicking around the UI to do what you need, if something unexpected happens it can unhide the window to show the user a message. This might be good enough for a desktop app, but if you're running it on a server it'll just freeze up your process with noone there to click it.

      For Office->PDF conversion of word docs, Aspose.Words has a fairly easy API and generally very accurate rendering. I hig

      • Re: (Score:3, Informative)

        No...it does not simulate clicking. It uses the underlying COM representation to perform its functions. That said, it does not work well in a multi-threaded environment, nor where you can't setup user (e.g. restricted web server credentials). So you either have to impersonate or use COM+ configuration to run the office tool under a different user name. So if you're just starting out...do not use Office automation in a server environment unless you're willing to deal with these issues. Try Aspose as suggeste
        • by Jaime2 ( 824950 )
          I wouldn't recommend Office Automation on a server if there is any alternative. For beginners, there's too many gotchas and for advanced users, there's plenty of alternatives that will do what you want without too much difficulty. Office with .Net is especially problematic because the COM components run as out-of-process servers and due to .Net's garbage collection and COM interoperability, they are difficult to get to shut down properly.
  • by jab ( 9153 ) on Thursday August 07, 2014 @02:33PM (#47624797) Homepage

    I've found these tools useful, with an honorable mention to gnupdf. I've never used it personally, but the code looks pretty solid. That said, when I really needed to produce great multilingual PDF I pulled out the PDF spec, gritted my teeth, and generated it directly.

    leptonica - turn images into PDF
    tesseract - turn images into searchable PDF
    qpdf - linearize PDF for random access over HTTP
    jhove - basic validation
    jhove-pdf-a - validation with better compatibility guarantees
    pdftk - command line tool for splicing pages together or apart
    ttx/FontTools - tool for modifying custom fonts
    reportlab - python library, easy to use but works best with Latin scripts

  • by Anonymous Coward

    You might try IText:

    pdftk uses it.

    • by mark-t ( 151149 )
      I would have posted to make the same recommendation if someone else hadn't already mentioned it, so I'll just follow up with another recommendation for itext. It's a pretty easy to use library, and it's been around for a while, so it's pretty stable.
      • Hi mark-t, OP here.

        itext appears to miss two-three major targets:
        • software architecture/api may not be too tidy; this is word of others, I cannot verify it since I've never used it
        • unclear licensing scheme, not free of commercial clashes
        • due to the above, community has been fragmented in 3 (?) big fragments :-(
    • by Stalus ( 646102 )

      The problem with iText is that it used to be MPL, but the maintainer got ticked off at commercial users several years ago and changed to license to AGPL. Apparently now they're relaxing the license for a fee, but they've changed their mind before - no guarantees that they won't change it again.

  • by Eponymous Coward ( 6097 ) on Thursday August 07, 2014 @02:36PM (#47624827)

    The best PDF software I've ever used is Prince XML.

    For years, we got by with HTMLDoc but finally dumped it because we absolutely needed unicode support.

    After trying many different packages, we settled on Prince. Our main constraints were performance related which you apparently aren't worried about, so maybe it's overkill for what you need.

    • by Elgonn ( 921934 )
      I'll have to second the recommendation for PrinceXml.

      After investigating and trying at least 9 other open source kits I eventually gave up and went with PrinceXml. You can try the 'trial' version easily and it just works easily. Their support is actually good as well. I wish there was a good pdf toolkit that was open source. But they all seem to just do one odd piece of the puzzle poorly.
    • PrinceXML is reliable, simple and produces the most beautiful PDFs ever. We've used it to replace InDesign as a tool for high end magazine page generation and have analysed the output of both - PrinceXML is significantly cleaner. However, it does help if you combine it with an image (re)sizing tool otherwise you end up with huge bloat with oversized images embedded in your PDF.

  • by RobotRunAmok ( 595286 ) on Thursday August 07, 2014 @02:42PM (#47624901)

    See, now thanks to you I have to clean all this coffee off my monitor...

  • by Anonymous Coward

    Make this becomes a requirement: support for making PDF/A.

  • by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Thursday August 07, 2014 @02:51PM (#47625017) Homepage

    At least on the C# side of things, the three libraries I've used (iTextSharp, PdfSharp, and Aspose.Pdf) are all a bit of an unintuitive mess with inconsistencies all over the place and very little documentation. In the case of iText, their revenue stream is putting all their documentation into a book for people to buy, so it's not uncommon to get an intentionally vague response when asking for help.

    I cycle between each depending on what I need to do, because they all have their own quirks and supported features. I've even piped from one to another to get certain parts of the process working.

    Good luck.

    • It could be that iText is just what he needs though. iTextSharp is the C# port of the original iText Java library. At times, it is easier to find code examples for iText than iTextSharp. Since the iTextSharp folks did their best to use C# conventions, the Java call names aren't always the same as the C# ones.

    • I like the APIs for Apple's PDFKit, but without cross-platform support I wouldn't use it for anything other than toy projects.
  • try... (Score:4, Informative)

    by Anonymous Coward on Thursday August 07, 2014 @02:54PM (#47625057)

    sudo apt-get install ghostscript pdftk poppler-utils

    ghostscript: /usr/bin/dvipdf
    ghostscript: /usr/bin/pdf2dsc
    ghostscript: /usr/bin/pdf2ps
    ghostscript: /usr/bin/pdfopt
    ghostscript: /usr/bin/ps2pdf
    pdftk: /usr/bin/pdftk
    poppler-utils: /usr/bin/pdffonts
    poppler-utils: /usr/bin/pdfimages
    poppler-utils: /usr/bin/pdfinfo
    poppler-utils: /usr/bin/pdfseparate
    poppler-utils: /usr/bin/pdftocairo
    poppler-utils: /usr/bin/pdftohtml
    poppler-utils: /usr/bin/pdftoppm
    poppler-utils: /usr/bin/pdftops
    poppler-utils: /usr/bin/pdftotext
    poppler-utils: /usr/bin/pdfunite

    • This. These three are what you need; you can then script a wrapper around them if you need to, but they'll provide you with everything you need as far as actual manipulation and display goes. Poppler keeps it simple, pdftk can handle most manipulation needs, and ghostscript is there to covere any esoteric issues that still fall under postscript/pdf/EPS.

      Might want to also include imagemagick, for import/export/optimization of most image formats you might be bursting/adding in PDF.

  • itext [itextpdf.com] may be one of those. It comes under AGPL and under a commercial license if you buy support from the company.

  • One word: PDFLib (Score:5, Informative)

    by Qbertino ( 265505 ) <moiraNO@SPAMmodparlor.com> on Thursday August 07, 2014 @04:08PM (#47625671)

    PDFLib GmbH [pdflib.com] (german LLC) build exactly one product: PDFLib. And they've been doing that since 1997. AFAIK the company was run by one guy - the initial developer - alone for most of the time. Now it's probably a shop of 5 or so.

    So it's not FOSS - yeah, that's a real shame. But the devs get to eat, you can demand service and response if you run into a bug and you can expect a good product and with PDFLib you're probably going to get it too.

    I haven't come across a single project doing non-trivial PDF stuff that doesn't use PDFLib. I've used it myself a little, and the cookbook that comes with the product was very good, so it comes recommended.

    My 2 cents.

    • I second PDFLib.

      I've created approximately 3 billion pages of PDF with it, since 2000. Very, very well done. The library is well thought out, and it can work even with bindings to languages that you would not think are usable. It's fast, really has a nice scope model, has a nice consistency, and rounds off the edges of PDF better than anything else with it.

      If you come it with their import library, and pcos library, it can do almost everything you want. The developers are helpful and don't mess around.

  • Java-only. Commercial. Feature-rich. Great company name. [faceless.org]

  • We used PDFClown [stefanochizzolini.it] (open source - Java) for production applications that were generating PDF's on the fly with hundreds of pages using Java and it performed very well.

  • iText meets some of your criteria.

    * open source, with a community following ; the kind of stuff Slashdotters would prefer = yep. Including several commercial books
    * tidy software architecture; simple things should remain simple = It's big and complete.
    * allow open API allowing usage across many languages (say: Python & Java) = Native Java. iText# is a port to .NET. Not a good fir for dynamic languages.
    * clear licensing status, not estranging future commercial use = AGPL + commercial license. Clear but

  • I started with Reportlab (the open source parts), found it to low level so I considered using the commercial edition because it has a templating language. As I was not very fond of investing time in learning yet another templating language, I reconsidered, and gave HTML with CSS a try for printing. I used wkhtmlpdf for a while but switched to WeasyPrint in the end: it was created for using HTML with CSS for printing, seemed to be more actively developed when compared to wkhtmlpdf.
  • Does anyone have experience with MuPDF? http://www.mupdf.com/ [mupdf.com] It's open-source, but requires license for commercial use. It appears to offer the best performance and portability. Its top level application(MuPDF) is highly rated across most platforms: Google Play, Apple Store and Ubuntu.
  • Is there anything that can handle the gruesome CT600 forms that the UK Tax authority require us to fill in every year? These have lots of embedded scripting and can only be read with Acrobat Reader. However, this year, Adobe have stopped releasing Acrobat for Linux.

    (An added bonus, the internal logic of the CT600 is buggy: for example if a particular tax option does not apply, it is fussy about the distinction of 0 vs empty, and this leads to subsequent validation errors (naturally with confusing messages)

  • I needed to layout a novel in a PDF. I've previously worked with iText and prefer not to construct a PDF one element at a time. I wanted an HTML to PDF workflow. I then tried wkhtmltopdf, but it doesn't support most of the hardcore design needs: hyphenation, widow/orphan control, alternating page margins, and page headers and footers based on the section of a document. PrinceXML supports all that. Writing only CSS, and based on html content, you'll be able to replicate anything a designer can do in InDesig
  • by sgrover ( 1167171 ) on Friday August 08, 2014 @01:56AM (#47628221) Homepage

    Not sure how current it is, but when I was looking for the same a few years back all that was really available for PHP was HTML->PDF libraries which were not sufficient for anything but the most basic forms. A decent invoice form was hard to get right with these tools. Then I came across FOP. Or more specifically XML-FOP. Combine that with a little XSL and the output was amazing, and could do more than the HTML converters. The only problem is that the FOP tool was a Java based program so PHP would need to execute a shell command to call it. With tight control of what info was passed to that shell command, it seemed an appropriate trade-off for the job at hand. You can still get FOP in the ubuntu repos - apt-get install fop. The learning curve for FOP is a little steep to begin, but no more than any other XML dialect. And being XML, you have a lot of options in building the required FOP file. I opted to put my data into my own XML file, then utilize an XSL file to convert it if/when needed. More details here: http://xmlgraphics.apache.org/... [apache.org]

  • dpkg -l | grep pdf | grep lib | grep ii
    ii libqpdf13:i386 5.1.1-1 i386 runtime library for PDF transformation/inspection software

  • Are both quite good, as they names suggest these are Perl modules though.

    • PDF::API2 is nice, but unfortunately it doesn't handle newer PDFs with compressed xrefs and/or object streams yet. Also, support for writing text in anything different from ASCII and maybe Latin-1 is close to missing.
  • by phrank ( 112038 )

    libHaru is a free C library (install libhpdf-dev in Debian) which supports generation, annotation, compression, encryption. See http://libharu.org/

  • At least on the rendering side, I have found MuPDF to be really good and stable: http://mupdf.com/ [mupdf.com]

  • I'll be honest that I don't have a broad range of experience with libraries. I've used a couple of html-to-pdf implementations and PDFSharp. The licensing for PDFSharp is very permissive, support can be paid for if required and the library is quite fast. As an aside, it has a cousin, MigraDoc, which produces abstract documents which you can finalise to Office formats, if you need that too.

    IMO, there is no perfect tool, but PDFSharp has served me well.

  • I think its a most urgent for lot of it and internet related, we wait for more update information in next

The intelligence of any discussion diminishes with the square of the number of participants. -- Adam Walinsky