Parsing PDFs using Python

I’m part of a project that has a need to import tabular data into a structured database, from PDF files that are based on digital or analog inputs.  [Digital input = PDF generated from computer applications; analog input = PDF generated from scanned paper documents.]

These are the preliminary research notes I made for myself a while ago that I am now publishing for reference by other project members.  These are neither conclusive nor comprehensive, but they are directionally relevant.

I.E. The amount of work it takes code to parse structured data from analog input PDFs is a significant hurdle, not to be underestimated (this blog post was the single most awe-inspiring find I made).  The strongest possible recommendation based on this research is GET AS MUCH OF THE DATA FROM DIGITAL SOURCES AS YOU CAN.

Packages/libraries/guidance

Evaluation of Packages

Possible issues

  • Encryption of the file
  • Compression of the file
  • Vector images, charts, graphs, other image formats
  • Form XObjects
  • Text contained in figures
  • Does text always appear in the same place on the page, or different every page/document?

PDF examples I tried parsing, to evaluate the packages

  • IRS 1040A
  • 2015-16-prelim-doc-web.pdf (Bellingham city budget)
    • Tabular data begins on page 30 (labelled Page 28)
    • PyPDF2 Parsing result: None of the tabular data is exported
    • SCARY: some financial tables are split across two pages
  • 2016-budget-highlights.pdf (Seattle city budget summary)
    • Tabular data begins on page 15-16 (labelled 15-16)
    • PyPDF2 Parsing result: this data parses out
  • FY2017 Proposed Budget-Lowell-MA (Lowell)
    • Financial tabular data starts at page 95-104, then 129-130, 138-139
    • More interesting are the small breakouts on subsequent pages e.g. 149, 151, 152, 162; 193, 195, 197
    • PyPDF2 Parsing result: all data I sampled appears to parse out

Experiment ideas

  • Build an example PDF for myself with XLS tables, and then see what comes out when the contents are parsed using one of these libraries
  • Build a script that spits out useful metadata about the document: which app/library generated it (e.g. Producer, Creator), size, # of pages
  • Build another script to verify there’s a non-trivial amount of ASCII/Unicode text in the document (I.e. so we confirm it doesn’t have to be OCR’d)

Experiments tried

Highlights from latest Lean Coffee

A lively crowd around the table at last Sunday’s Lean Coffee session, and fresh faces to the discussion (thank you to Scott for inviting your colleagues, and to all for coming out).

There’s no way I can do justice to the breadth and depth of the discussion, so I’m just going to mention those things I wrote down on sticky notes to myself – the things that I thought, “Boy, I should get this tattooed on myself somewhere”:

  • Don’t Automate Waste – a killer principle from the Lean camp that Dan Walsh graced us with, it speaks to the tension of not optimizing early, and to my instinct not to assume you have the solution without experimentation
  • “Agile/Scrum is a Problem Discovery Framework, not a Project Management Methodology” – courtesy of Scott Henderson, every word here lends subtle meaning to the mental shift it encourages
  • Lean Coffee has been used successfully in at least two settings I haven’t tried – as the basis for both the Retrospective and Brainstorming sessions (which helps get ideas on the table that might be ‘swallowed’ by the time attention comes around to the less-confident individual)
  • Code 46 and Sully were the two movies that came up in conversation, so off to Netflix I go

2016-12-04 11.59.58.jpg

I posed a question to the group which came back with some great thoughts: “how to workaround a situation [which I’ve observed at many software companies] where the testing infrastructure/coverage isn’t reliable, and there’s no quick route to addressing that?”

  1. Ensure that you at least have Unit Tests included in the Definition of Done
  2. Try an experiment where for a single sprint, the team only works on writing unit tests – when this was tried at one organization, it surprised everyone how much progress and coverage could truly be made
  3. Try a regular “Game Day” exercise – run tabletop simulation of a production bug that takes out one or more of your customer-facing services.  This identifies not only who must be involved, but also how long it can take to execute corrective action once identified, and ultimately can result in significant time savings by making upstream changes in product/devops.
  4. Run an occasional discussion at Retrospective to ask “what’s the worst thing we could do to the product?”  This can uncover issues and concerns that otherwise go unspoken by folks who are worried about retribution or downplaying.
  5. And the most obvious, start out future sprints by planning tests up front (either via TDD or manually between QA and Dev)

Occupied Neurons, April 2016

https://medium.com/@sproutworx/six-templates-for-aspiring-product-managers-a568d3115cfe#.swkk52f58
So many Product Managers are making it up as they go along – generating whatever kinds of artifacts will get them past the next checkpoint and keep all the spinning plates from veering off into ether. This is the first time in a long time I’ve seen someone propose some viable, useable and not totally generic tools for capturing their PM thinking. Well worth a look.

https://medium.com/swlh/mvpm-minimum-viable-product-manager-e1aeb8dd421
The “BUT” model for Product Management is a hot topic, and there’s a number of folks taking a kick at deciphering it in their context. I’ve got a spin on it that I’ll write about soon, but this is a great take on the model too.

https://schloss.quora.com/Design-doesnt-deserve-a-seat-at-the-table
Captures all my feelings about the complaint from Designers (and Security reviewers, and all others in the “product quality” disciplines) that they get left out of discussions they *should* be part of. My own rant on the subject doesn’t do this subject justice, but I’m convinced that we *earn* our right to a seat by helping steer, working through the messy quagmire that is real software delivery (not just throwing pixel-perfect portfolio fodder over the wall).

http://www.eventbrite.com/e/resilience-and-the-future-of-work-responsiveorg-un-conference-tickets-24045089510
An unconference to expand awareness of a movement among leading thinkers on how to organize work in the 21st century. Looks fascinating – unconference format is dense and high-learning, the subject is still pretty fresh and new (despite the myriad of books building up to this over the last decade), and the energy in the Portland community is bursting.

Meetups where you’ll find Mike’s hat, Spring 2016 edition

Occasionally I’ll tell people I meet about all the meetups I have so much fun at.

Or rather, I’ll try to enumerate them all, and fail each and every time.

Primarily because there’s so many meetups I like to check in on.

So occasionally I’ll enumerate them like this, so that my friends have a valiant hope of crossing paths with me before the amazing event has passed.

Meetups I’m slavishly devoted to

Meetups I’ll attend anytime they’re alive

Meetups I sample like caviar – occasionally and cautiously

Recent additions that may soon pass the test of my time

 

Epiphany of Volunteering

Been struggling with the desire to volunteer – to take my skills out to organizations and people who don’t normally have access to the kind of big corporate expertise – and to give myself opportunities to give back to my community.

Only problem is: the kinds of groups in which I want to volunteer (eg. Hack Oregon) are filled with amazing coders who might not feel friendly and welcoming to a “business/product/design” guy who wants to help out but isn’t a coder or database geek.

I’ve been out to a couple of events, and watched the participants gather together in their natural tendencies. I start out feeling self-conscious and a deficit for any group I force myself into, and end up just chatting with whoever it feels like might also be feeling disconnected.

I’ve lost my nerve with such organizations and ended up not finding an outlet for my desire to help, contribute my energy and experience, and effect change.

Epiphany
Today for no explicable reason, it occurred to me that rather than approaching volunteering as a place to contribute, and instead set my goal to “learning”.

I thought of this when Catherine Nikolovsky talked about the number of Big Data and data visualization nerds her organization, and I lit up thinking, “I want to learn about Big Data and Dataviz!”

What if I showed up and attempted to simply ask questions, see how Big Data apps are built, and what kinds of decisions are made in developing an effective data visualization?

Do I have the nerve to show up and insert myself without any ego – without an intention to help, but rather just to listen?

And now, a random picture from today’s Facebook distractions:

IMG_2277

PDX-local Meetups in my coherent rotation

(Hah – I meant to type “current rotation” but sometimes autocorrect makes me sound much more nuanced than I meant)

Here’s where you’ll find me lurking and entertaining the not-so-innocent bystanders: meetup-in-a-bar

Occasional fly-bys:

Where’s Mike, September edition

Summer’s over, I can go back to being a couch slug and no one will be the wiser because they’re all back indoors too. I love the great indoors, with all of those mental vistas to take in.

Slug

I wanna get back in touch with you. See you here?:

Further out, I’ll be at

Where are you headed this month?