Parsing PDFs using Python

I’m part of a project that has a need to import tabular data into a structured database, from PDF files that are based on digital or analog inputs.  [Digital input = PDF generated from computer applications; analog input = PDF generated from scanned paper documents.]

These are the preliminary research notes I made for myself a while ago that I am now publishing for reference by other project members.  These are neither conclusive nor comprehensive, but they are directionally relevant.

I.E. The amount of work it takes code to parse structured data from analog input PDFs is a significant hurdle, not to be underestimated (this blog post was the single most awe-inspiring find I made).  The strongest possible recommendation based on this research is GET AS MUCH OF THE DATA FROM DIGITAL SOURCES AS YOU CAN.

Packages/libraries/guidance

Evaluation of Packages

Possible issues

  • Encryption of the file
  • Compression of the file
  • Vector images, charts, graphs, other image formats
  • Form XObjects
  • Text contained in figures
  • Does text always appear in the same place on the page, or different every page/document?

PDF examples I tried parsing, to evaluate the packages

  • IRS 1040A
  • 2015-16-prelim-doc-web.pdf (Bellingham city budget)
    • Tabular data begins on page 30 (labelled Page 28)
    • PyPDF2 Parsing result: None of the tabular data is exported
    • SCARY: some financial tables are split across two pages
  • 2016-budget-highlights.pdf (Seattle city budget summary)
    • Tabular data begins on page 15-16 (labelled 15-16)
    • PyPDF2 Parsing result: this data parses out
  • FY2017 Proposed Budget-Lowell-MA (Lowell)
    • Financial tabular data starts at page 95-104, then 129-130, 138-139
    • More interesting are the small breakouts on subsequent pages e.g. 149, 151, 152, 162; 193, 195, 197
    • PyPDF2 Parsing result: all data I sampled appears to parse out

Experiment ideas

  • Build an example PDF for myself with XLS tables, and then see what comes out when the contents are parsed using one of these libraries
  • Build a script that spits out useful metadata about the document: which app/library generated it (e.g. Producer, Creator), size, # of pages
  • Build another script to verify there’s a non-trivial amount of ASCII/Unicode text in the document (I.e. so we confirm it doesn’t have to be OCR’d)

Experiments tried

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s