Parsing PDFs using Python

I’m part of a project that has a need to import tabular data into a structured database, from PDF files that are based on digital or analog inputs.  [Digital input = PDF generated from computer applications; analog input = PDF generated from scanned paper documents.]

These are the preliminary research notes I made for myself a while ago that I am now publishing for reference by other project members.  These are neither conclusive nor comprehensive, but they are directionally relevant.

I.E. The amount of work it takes code to parse structured data from analog input PDFs is a significant hurdle, not to be underestimated (this blog post was the single most awe-inspiring find I made).  The strongest possible recommendation based on this research is GET AS MUCH OF THE DATA FROM DIGITAL SOURCES AS YOU CAN.

Packages/libraries/guidance

Evaluation of Packages

Possible issues

  • Encryption of the file
  • Compression of the file
  • Vector images, charts, graphs, other image formats
  • Form XObjects
  • Text contained in figures
  • Does text always appear in the same place on the page, or different every page/document?

PDF examples I tried parsing, to evaluate the packages

  • IRS 1040A
  • 2015-16-prelim-doc-web.pdf (Bellingham city budget)
    • Tabular data begins on page 30 (labelled Page 28)
    • PyPDF2 Parsing result: None of the tabular data is exported
    • SCARY: some financial tables are split across two pages
  • 2016-budget-highlights.pdf (Seattle city budget summary)
    • Tabular data begins on page 15-16 (labelled 15-16)
    • PyPDF2 Parsing result: this data parses out
  • FY2017 Proposed Budget-Lowell-MA (Lowell)
    • Financial tabular data starts at page 95-104, then 129-130, 138-139
    • More interesting are the small breakouts on subsequent pages e.g. 149, 151, 152, 162; 193, 195, 197
    • PyPDF2 Parsing result: all data I sampled appears to parse out

Experiment ideas

  • Build an example PDF for myself with XLS tables, and then see what comes out when the contents are parsed using one of these libraries
  • Build a script that spits out useful metadata about the document: which app/library generated it (e.g. Producer, Creator), size, # of pages
  • Build another script to verify there’s a non-trivial amount of ASCII/Unicode text in the document (I.e. so we confirm it doesn’t have to be OCR’d)

Experiments tried

Advertisements

Update my Contacts with Python: thinking through the options

Alright folks, that’s the bell.  When LinkedIn stops thinking of itself as a professional contact manager, you know there’s no profit in it, and it’s time to manage this stuff yourself.

Problem To Solve

I’ve been hemming and hawing for a couple of years, ever since Evernote shut down their Hello app, about how to remember who I’ve met and where I met them.  I’m a Meetup junkie (with no rehab in sight) and I’ve developed a decent network of friends and acquaintances that make it easy for me to attend new events and conferences in town – I’ll always “know” someone there (though not always remember why/how I know them or even what their name is).

When I first discovered Evernote Hello, it seemed like the perfect tool for me – provided me a timeline view of all the people I’d met, with rich notes on all the events I’d seen them at and where those places were.  It never entirely gelled, it sporadically did and did NOT support business card import (pay for play mostly), and it was only good for those people who gave me enough info for me to link them.  Even with all those imperfections, I remember regularly scanning that list (from a quiet corner at a meetup/party/conference) before approaching someone I *knew* I’d seen before, but couldn’t remember why.  [Google Glasses briefly promised to solve this problem for me too, but that tech is off somewhere, licking its wounds and promising to come back in ten years when we’re ready for it.]

What other options do I have, before settling in to “do it myself”?

  • Pay the big players e.g. SalesForce, LinkedIn
    • Salesforce: smallest SKUs I could find @ $25/month [nope]
    • LinkedIn “Sales” SKU: $65/month [NOPE]
  • Get a cheap/trustworthy/likely-to-survive-more-than-a-year app
    • Plenty of apps I’ve evaluated that sound sketchy, or likely to steal your data, or are so under-funded that they’re likely to die off in a few months

Requirements

Do it myself then.  Now I’ve got a smaller problem set to solve:

  1. Enforce synchronization between my iPhone Contacts.app, the iCloud replica (which isn’t a perfect replica) and my Google Contacts (which are a VERY spotty replica).
    • Actually, let’s be MVP about this: all I *need* right now is a way of automating edits to Contacts on my iPhone.  I assume that the most reliable way of doing this is to make edits to the iCloud.com copy of the contact and let it replicate down to my phone.
    • the Google Contacts sync is a future-proofing move, and one that theoretically sounded free (just needed to flip a toggle on my iPhone profile), but which in practice seems to be built so badly that only about 20% of my contacts have ever sync’d with Google
  2. Add/update information to my contacts such as photos, “first met” context (who introduced, what event met at) and other random details they’ve confessed to me (other attempts to hook my memory) – *WITHOUT* linking my iPhone contacts with either LinkedIn or Facebook (who will of course forever scrape all that data up to their cloud, which I do *not* want to do – to them or me).

Test the Sync

How can I test my requirements in the cheapest way possible?

  • Make hand edits to the iCloud.com contacts and check that it syncs to the iPhone Contacts.app
    • Result: sync to iPhone within seconds
  •  Make hand edits to contacts in Contacts.app and check that it syncs to iCloud.com contact
    • Result: sync to iCloud within seconds

OK, so once I have data that I want to add to an iCloud contact, and code (Python for me please!) that can write to iCloud contacts, it should be trivial to edit/append.

Here’s all the LinkedIn Data I Want

Data that’s crucial to remembering who someone is:

  • Date we first connected on LinkedIn
  • Tags
  • Notes
  • Picture

Additional data that can help me fill in context if I want to dig further:

  • current company
  • current title
  • Twitter ID
  • Web site addresses
  • Previous companies

And metadata that can help uniquely identify people when reading or writing from other directories:

  • Email address
  • Phone number

How to Get my LinkedIn connection data?

OK, so (as of 2016-12-15 at 12:30pm PST) there’s three ways I can think of pulling down the data I’ve peppered into my LinkedIn connections:

  1. User Data Archive: request an export of your user data from LinkedIn
  2. LinkedIn API: request data for specified Connections using LinkedIn’s supported developer APIs
  3. Web Scraping: iterate over every Connection and pull fields via CSS using e.g. Beautiful Soup

User Data Archive

This *sounds* like the most efficient and straightforward way to get this data.  The “Relationship Section” announcement even implies that I’ll get everything I want:

If you want to download your existing Notes and Tags, you’ll have the option to do so through March 31, 2017…. Your notes and tags will be in the file named Contacts.

The initial data dump included everything except a Contacts.csv file.  The later Complete_LinkedInDataExport_12-16-2016 [ISO 8601 anyone?] included the data promised and nearly nothing else:

  • Connections.csv: First Name, Last Name, Email Address, Current Company, Current Position, Tags
  • Contacts.csv: First Name, Last Name, Email (mostly blank), Notes, Tags

I didn’t expect to get Picture, but I was hoping for Date First Connected, and while the rest of the data isn’t strictly necessary, it’s certainly annoying that LinkedIn is so friggin frugal.

Regardless, I have almost no other source for pictures for my professional contacts, and that is pretty essential for recalling someone I’ve met only a handful of times, so while helpful, this wasn’t sufficient.

LinkedIn API

The next most reliable way to attack this data is to programmatically request it.  However, as I would’ve expected from this “roach motel” of user-generated data, they don’t even support an API to request all Connections from your user account (merely sign-in and submit data).

Where they do make reference to user data, it’s in a highly-regulated set of Member Profile fields:

  • With the r_basicprofile permission, you can get first-name, last-name, positions, picture-url plus some other data I don’t need
  • With the r_emailaddress permission, you can get the user’s primary email address
  • For developers accepted into “Apply with LinkedIn”, and with the r_fullprofile permission, you can further get date-of-birth and member-url-resources
  • For those “Apply with LinkedIn” developers who have the r_contactinfo permssion, you can further get phone-numbers and twitter-accounts

After registering a new application, I am immediately given the ability to grant the following permissions to my app: r_basicprofile, r_emailaddress.  That’ll get me picture-url, if I can figure out a way to enumerate all the Connections for my account.

(A half-hour sorting through Chrome Dev Tools’ Network outputs later…)

Looks like there’s a handy endpoint that lets the browser enumerate pretty much all the data I want:

https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

That bears further investigation.

Web Scraping

While this approach doesn’t have the built-in restrictions with the LinkedIn APIs, there’s at least three challenges I can forsee so far:

  1. LinkedIn requires authentication, and OAuth 2.0 at that (at least for API access).  Integrating OAuth into a Beautiful Soup script isn’t something I’ve heard of before, but I’m seeing some interesting code fragments and tutorials that could be helpful, and it appears that the requests package can do OAuth 1 & 2.
  2. LinkedIn has helpfully implemented the “infinite scroll” AJAX behaviour on the Connections page.
    • There are ways to work with this behaviour, but it sure feels cumbersome – to the point I almost feel like doing this work by hand would just be faster.
  3. Navigating automatically to each linked page (each Connection) from the Connections page isn’t something I am entirely confident about
    • Though I imagine it should be as easy as “for each Connection in Connections, load the page, then find the data with this CSS attribute, and store it in an array of whatever form you like”.  The mechanize package promises to make the link navigation easy.

Am I Ready for This Much Effort?

It sure feels like there’s a lot of barriers in the way to just collecting the info I’ve accumulated in LinkedIn about my connections.  Would it take me less time to just browse each connection page and hand copy/paste the data from LinkedIn to iCloud?  Almost certainly.  To together a Beautiful Soup + requests + various github modules solution would probably take me 20-30 hours I’m guessing, from all the reading and piecing together code fragments from various sources, to debugging and troubleshooting, to making something that spits out the data and then automatically uploads it without mucking up existing data.

Kinda takes the fun out of it that way, doesn’t it?  I mean, the “glory” of writing code that’ll do something I haven’t found anyone else do, that’s a little boost of ego and all.  Still, it’s hard to believe this kind of thing hasn’t been solved elsewhere – am I the only person with this bad of a memory, and this much of a drive to keep myself from looking like Leonard Shelby at every meetup?

What’s worse though, for embarking on this thing, is that I’d bet in six months’ time, LinkedIn and/or iCloud will have ‘broken’ enough of their site(s) that I wouldn’t be able to just re-use what I wrote the first time.  Maintenance of this kind of specialized/unique code feels pretty brutal, especially if no one else is expected to use it (or at least, I don’t have any kind of following to make it likely folks will find my stuff on github).

Still, I don’t think I can leave this itch entirely unscratched.  My gut tells me I should dig into that Contacts API first before embarking on the spelunking adventure that is Beautiful Soup.

This time, success: Flask-on-AWS tutorial (with advanced use of virtualenv)

Last time I tried this, I ended up semi-deliberately choosing to use Python 3 for a tutorial (I didn’t realize quickly enough) was built around Python 2.

After cleaning up my experiment I remembered that the default python on my MacBook was still python 2.7.10, which gave me the idea I might be able to re-run that tutorial with all-Python 2 dependencies.  Or so it seemed.

Strangely, the first step both went better and no better than last time:

Mac4Mike:flask-aws-tutorial mike$ virtualenv flask-aws
Using base prefix '/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5'
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3.5
Also creating executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Installing setuptools, pip, wheel...done.

Yes it didn’t throw any errors, but no it didn’t use the base Python 2 that I’d hoped.  Somehow the fact that I’ve installed Python 3 on my system is still getting picked up by virtualenv, so I needed to dig further into how virtualenv can be used to truly insulate from Python 3.

Found a decent article here that gave me hope, and even though they punted to using the virtualenvwrapper scripts, it still clued me in to the virtualenv parameter “-p”, so this seemed to work like a charm:

Mac4Mike:flask-aws-tutorial mike$ virtualenv flask-aws -p /usr/bin/python
Running virtualenv with interpreter /usr/bin/python
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Installing setuptools, pip, wheel...done.

This time?  The requirements install worked like a charm:

Successfully installed Flask-0.10.1 Flask-SQLAlchemy-2.0 Flask-WTF-0.10.3 Jinja2-2.7.3 MarkupSafe-0.23 PyMySQL-0.6.3 SQLAlchemy-0.9.8 WTForms-2.0.1 Werkzeug-0.9.6 argparse-1.2.1 boto-2.28.0 itsdangerous-0.24 newrelic-2.74.0.54

Then (since I still had all the config in place), I ran pip install awsebcli and skipped all the way to the bottom of the tutorial and tried eb deploy:

INFO: Deploying new version to instance(s).                         
ERROR: Your requirements.txt is invalid. Snapshot your logs for details.
ERROR: [Instance: i-01b45c4d01c070555] Command failed on instance. Return code: 1 Output: (TRUNCATED)...)
  File "/usr/lib64/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/opt/python/run/venv/bin/pip install -r /opt/python/ondeck/app/requirements.txt' returned non-zero exit status 1. 
Hook /opt/elasticbeanstalk/hooks/appdeploy/pre/03deploy.py failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
INFO: Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
ERROR: Unsuccessful command execution on instance id(s) 'i-01b45c4d01c070555'. Aborting the operation.
ERROR: Failed to deploy application.

This kept barfing over and over until I remembered that the target environment was still configured for Python 3.4.  Fortunately or not, you can’t change major versions of the platform – so back to eb init I go (with the -i parameter to re-initialize).

This time around?  The command eb deploy worked like a charm.

Lesson: be *very* explicit about your Python versions when messing with someone else’s code.  [Duh.]

Getting to know AWS with Python (the free way)

I’ve run through all the easy AWS tutorials and I was looking to roll sleeves up and take it one level deeper, so (given I’ve recently completed a Python class, and given I’m trying to stay on a budget) I hunted around and found some ideas for putting a sample Flask app up on AWS Elastic Beanstalk, that cost as little as possible to use.  Here’s how I wove them together (into an almost-functional cloud solution first time around).

Set aside Anaconda, install Python3

First step in the tutorial I found was building the Python app locally (then eventually “freezing” it and uploading to AWS EB).  [Note there’s a similar tutorial from AWS here, which is good for comparison.]  [Also note: after I mostly finished my work, I confirmed that the tutorial I’m using was written for Python 2.7, and yet I’d blundered into it with Python 3.4.  Not for the faint of heart, nor for those who like non-Degraded health status.]

Which start with using virtualenv to build up the app profile.

But for me, virtualenv threw an error after I installed it (with pip install virtualenv):

ERROR: virtualenv is not compatible with this system or executable 

Okay…so Ryan Wilcox gave us a clue that you might be using the wrong python.  Running which python told me I’m directed to the anaconda version, so I commented that PATH modification out of .bash_profile and installed python3 using homebrew (which I’d previously installed on my MacBook).

Setup the Flask environment Debug virtualenv

Surprising no one but myself, by removing anaconda and installing python3, I lost access not only to virtualenv but also to pip, so I guess all that was wrapped in the anaconda environment.  Running brew install pip reports the following:

Updating Homebrew...
Error: No available formula with the name "pip" 
Homebrew provides pip via: `brew install python`. However you will then
have two Pythons installed on your Mac, so alternatively you can install
pip via the instructions at:
  https://pip.readthedocs.io/en/stable/installing/

The only option from that article seems to be running get-pip.py, so I ran python3 get-pip.py (in case the script installed a different pip based on which version of python was running the script).  Running pip –version returned this, so I felt safe enough to proceed:

pip 9.0.1 from /usr/local/lib/python3.5/site-packages (python 3.5)

Ran pip install virtualenv, now we’re back where I started (but without the anaconda crutch that I don’t entirely understand, and didn’t entirely trust to be compatible with these AWS EB tutorials).

Running virtualenv flask-aws from within the local clone of the tutorial repo throws this:

Using base prefix '/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5'
Overwriting /Users/mike/code/flask-aws-tutorial/flask-aws/lib/python3.5/orig-prefix.txt with new content
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3.5
Not overwriting existing python script /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python (you must use /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3.5)
Traceback (most recent call last):
  File "/usr/local/bin/virtualenv", line 11, in 
    sys.exit(main())
  File "/usr/local/lib/python3.5/site-packages/virtualenv.py", line 713, in main
    symlink=options.symlink)
  File "/usr/local/lib/python3.5/site-packages/virtualenv.py", line 925, in create_environment
    site_packages=site_packages, clear=clear, symlink=symlink))
  File "/usr/local/lib/python3.5/site-packages/virtualenv.py", line 1370, in install_python
    os.symlink(py_executable_base, full_pth)
FileExistsError: [Errno 17] File exists: 'python3.5' -> '/Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3'

Hmm, is this what virtualenv does?  OK, if that’s true why is something intended to insulate from external dependencies seemingly getting borked by external dependencies?

Upon closer examination, it appears that what the virtualenv script tried to do the first time was generate a symlink to a python3.5 interpreter, and got tripped when it found a symlink already there.  However, the second time I ran the command (hoping that it is self-healing), I got yelled at about “too many symlinks”, and discover that the targets now have a circular loop:

Mac4Mike:bin mike$ ls -la
total 24
drwxr-xr-x  5 mike  staff  170 Dec 11 12:26 .
drwxr-xr-x  6 mike  staff  204 Dec 11 12:26 ..
lrwxr-xr-x  1 mike  staff    9 Dec 11 12:26 python -> python3.5
lrwxr-xr-x  1 mike  staff    6 Dec 11 11:52 python3 -> python
lrwxr-xr-x  1 mike  staff    6 Dec 11 11:52 python3.5 -> python

If I’m reading the above stack trace correctly, they were trying to create a symlink from python3.5 to some other target.  So all it should require is deleting the python3.5 symlink, yes?  Easy.

Aside: then running virtualenv flask-aws from the correct location (but an old, not-refreshed shell) spills this:

Using base prefix '/Users/mike/anaconda3'
Overwriting /Users/mike/code/flask-aws-tutorial/flask-aws/lib/python3.5/orig-prefix.txt with new content
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Traceback (most recent call last):
  File "/Users/mike/anaconda3/bin/virtualenv", line 11, in 
    sys.exit(main())
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 713, in main
    symlink=options.symlink)
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 925, in create_environment
    site_packages=site_packages, clear=clear, symlink=symlink))
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 1387, in install_python
    raise e
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 1379, in install_python
    stdout=subprocess.PIPE)
  File "/Users/mike/anaconda3/lib/python3.5/subprocess.py", line 947, in __init__
    restore_signals, start_new_session)
  File "/Users/mike/anaconda3/lib/python3.5/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg)
OSError: [Errno 62] Too many levels of symbolic links

Try again from a shell where anaconda’s been removed from $PATH?  Works fine:

Installing setuptools, pip, wheel...done.

Step 2: Setting up the Flask environment

Installing the requirements (using pip install -r requirements.txt) went *mostly* smooth, but it broke down on either the distribute or itsdangerous dependency:

Collecting distribute==0.6.24 (from -r requirements.txt (line 11))
  Downloading distribute-0.6.24.tar.gz (620kB)
    100% |████████████████████████████████| 624kB 1.1MB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "", line 1, in 
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/setuptools/__init__.py", line 2, in 
        from setuptools.extension import Extension, Library
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/setuptools/extension.py", line 2, in 
        from setuptools.dist import _get_unpatched
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/setuptools/dist.py", line 103
        except ValueError, e:
                         ^
    SyntaxError: invalid syntax
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/

Unfortunately the “/pip-build-6gettd5v/” folder was already deleted by the time I got there to look at its contents.  So I re-ran the script and noticed that all dependencies through to distribute reported as “Using cached distribute-0.6.24.tar.gz”.

Figured I’d just pip install itsdangerous by hand, and if I got into too much trouble I could always uninstall it right?  Well, that didn’t seem to help arrest this error – presumably pip caches all the requirements files first and then installs them – so I figured I’d divide the requirements.txt file into two (creating requirements2.txt and stuffing the last three files there instead) and see if that helped bypass the problem in installing the first 11 requirements.

Nope, that didn’t work, so I also moved the line “distribute==0.6.24” out of requirements.txt into requirements2.txt:

Successfully installed Flask-0.10.1 Flask-SQLAlchemy-2.0 Flask-WTF-0.10.3 Jinja2-2.7.3 MarkupSafe-0.23 PyMySQL-0.6.3 SQLAlchemy-0.9.8 WTForms-2.0.1 Werkzeug-0.9.6 argparse-1.2.1

Yup, that did it, so off to pip install requirements2.txt:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-5y_r3gg0/distribute/

OK, so then I removed the “distribute==0.6.24” line entirely from requirements2.txt:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-5g9i5fpl/wsgiref/

Crap, wsgiref too?  OK, pip install wsigiref==0.1.2 by hand:

Collecting wsgiref==0.1.2
  Using cached wsgiref-0.1.2.zip
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "", line 1, in 
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-suqskeut/wsgiref/setup.py", line 5, in 
        import ez_setup
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-suqskeut/wsgiref/ez_setup/__init__.py", line 170
        print "Setuptools version",version,"or greater has been installed."
                                 ^
    SyntaxError: Missing parentheses in call to 'print'

Shit, even worse – the package itself craps out.  Good news is, I know exactly what’s wrong here, because the print method (function?) requires () in Python3.  (See, I *did* learn something from that anti-Zed rant.)

Double crap – wsgiref‘s latest version IS 0.1.2.

Triple-crap – that code’s docs haven’t been touched in forever, and don’t exist in a git repo against which issues can be filed.

Damn, this *really* sucks.  Seems I’ve been trapped in the hell that is “tutorials written for Python2”.  The only way I could get around this error is to edit the source of the cached wsgiref-0.1.2.zip file – do such files get deleted from $TMPDIR when pip exits?  Or are they stuffed in ~/Library/Caches/pip in some obfuscated filename?

Ultimately it didn’t matter – I found that pip install –download ~/ wsgiref==0.1.2 worked to dump the file exactly where I wanted.

And yes, wrapping lines 170 and 171 with () after the print verb (and re-zip’ing the folder contents) allowed me to fully install wsgiref-0.1.2 using pip install –no-index –find-links=~/ wsgiref-0.1.2.zip.

OK, finally ran pip install boto==2.28.0 and that’s all the dependencies installed.

Next!

Step 3: Create the AWS RDS

While the AWS UI has changed in subtle but UX-improving ways, this part wasn’t hard to follow.  I wasn’t happy about seeing the “all traffic, all sources” security group applied here, but not knowing what my ultimate target configuration might look like, I made the classic blunder of saying to self, “I’ll lock it down later.”

Step 4: Add tables

It wasn’t entirely clear, so here’s the construction of the SQLALCHEMY_DATABASE_URI parameter in config.py:

  • = Master Username you chose on the Specify DB Details page
    e.g. “awsflasktst”
  • = Master Password you chose on the Specify DB Details page
    e.g. “awsflasktstpwd”
  • = Endpoint string from DB Instance page
    e.g. “flask_tst.cxcmcajseabc.us-west-2.rds.amazonaws.com:3306”
  • = Database Name you chose on the Configure Advanced Settings page
    e.g. “awstflasktstdb”

So in my case that line read (well, it didn’t because I’m not telling you my *actual* configuration, but this lets you know how I parsed the tutorial guidance):

SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://awsflasktst:awsflasktstpwd@flask_tst.cxcmcajseabc.us-west-2.rds.amazonaws.com:3306/awstflasktstdb'

Step 4.6 Install NewRelic

Because I’m a former Relic and I’m still enthused about APM technology, I decided to install the newrelic package via pip install newrelic.  A few steps into the Quick Start guide and it was reporting all the…tiny amounts of traffic I was giving myself.

God knows if this’ll survive the pip freeze, but it’s sure worth a shot.  [Spoiler: cloud deployment was ultimately unsuccessful, so it’s moot.]

Step 5: Elastic Beanstalk setup

Installing the CLI was easy enough:

Successfully installed awsebcli-3.8.7 blessed-1.9.5 botocore-1.4.85 cement-2.8.2 colorama-0.3.7 docker-py-1.7.2 dockerpty-0.4.1 docopt-0.6.2 docutils-0.13.1 jmespath-0.9.0 pathspec-0.5.0 python-dateutil-2.6.0 pyyaml-3.12 requests-2.9.1 semantic-version-2.5.0 six-1.10.0 tabulate-0.7.5 wcwidth-0.1.7 websocket-client-0.40.0

Setup of the user account was fairly easy, although again it’s clear that AWS has done a lot of refactoring of their UI flows.  Just read ahead a little in the tutorial and you’ll be able to fill in the blanks.  (Although again, it’s a little disturbing to see a tutorial be this cavalier about the access permissions being granted to a simple web app.)

When we got to the eb init command I ran into a snag:

ERROR: ConfigParseError :: Unable to parse config file: /Users/mike/.aws/config

I checked and I *have* a config file there, and it’s already been populated with Packer credentials info (for a separate experiment I was working on previously).

So what did I do?  I renamed the config and credentials files to set them aside for now, of course!

Like me, you might also see an EB application listed, so I chose “2” not “1”.  No big deal.

There was also an unexpected option to use AWS CodeCommit, but since I’m not even sure I’m keeping this app around, I did not take that choice for now.

(And once again, this tutorial was super-cavalier about ignoring the SSH option that *everyone* always uses, so just closed my eyes and prayed I never inherit these bad habits…)

“Time to deploy this bad boy.”  Bad boy indeed – Python 2.7, *no* security measures whatsoever.  This boy is VERY bad.

Step 6: Deployment

So there’s this apparent dependency between the EBCLI and your local git repo.  If your changes aren’t committed, then what gets deployed won’t reflect those changes.

However, the part about running git push on the repo seemed unnecessary.  If the local CLI tool checks my repo, they’re only checking the local one, not the remote one, so once we run git commit, it shouldn’t matter whether those changes were pushed upstream.

However, knowing that I monkeyed with the requirements.txt file, I thought I’d also git add that one to my commit history:

git add requirements.txt
git commit -m "added newrelic requirement"

After initiating eb create, I was presented with a question never mentioned in the tutorial (another new AWS feature!), and not documented specifically in any of the ELB documentation I reviewed:

Select a load balancer type
1) classic
2) application
(default is 1): 

While my experience in this space tells me “application” sounds like the right choice, I chose the default for sake of convenience/sanity.

Proceeded with eb deploy, and it was looking good until we ran into the problem I was expecting to see:

ERROR: Your requirements.txt is invalid. Snapshot your logs for details.
ERROR: [Instance: i-01b45c4d01c070555] Command failed on instance. Return code: 1 Output: (TRUNCATED)...)
  File "/usr/lib64/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/opt/python/run/venv/bin/pip install -r /opt/python/ondeck/app/requirements.txt' returned non-zero exit status 1. 
Hook /opt/elasticbeanstalk/hooks/appdeploy/pre/03deploy.py failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
INFO: Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
WARN: Environment health has transitioned from Pending to Degraded. Command failed on all instances. Initialization completed 7 seconds ago and took 3 minutes.
ERROR: Create environment operation is complete, but with errors. For more information, see troubleshooting documentation.

Lesson

Ultimately unsuccessful, but was it satisfying?  Heck yes, grappling with all the tools and troubleshooting various aspects of the procedures was a wonderful learning experience.  [And as it turned out, was a great primer to lead into a less hand-holding tutorial I found later that I’ll try next.]

Coda

Well, heck.  That was easier than I thought.  Succeeded after another run-through by forcing Python 2.7 on both the local app configuration and in a new Elastic Beanstalk app.

Git Tricks I Keep Having to Look Up

Another trail of breadcrumbs for myself…me being the kind of guy I am, I try to do things the “git” way so I don’t piss off those upstream who might otherwise not have the energy to deal with another PR.

Sync Fork with Upstream
http://stackoverflow.com/questions/3903817/pull-new-updates-from-original-github-repository-into-forked-github-repository?rq=1

Git Squash
https://ariejan.net/2011/07/05/git-squash-your-latests-commits-into-one/

Reverting Git Commits
http://stackoverflow.com/questions/6971717/github-how-to-revert-changes-to-previous-state#6971775

Switching branches
http://stackoverflow.com/questions/1475037/switching-branches-in-git

Push.default warning
http://stackoverflow.com/questions/13148066/warning-push-default-is-unset-its-implicit-value-is-changing-in-git-2-0#13148313

The Git Workflow (according to Atlassian)
https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow

Switching to SSH for an existing local repo
http://stackoverflow.com/questions/6565357/git-push-requires-username-and-password

The classic…
http://nvie.com/posts/a-successful-git-branching-model/

Occupied Neurons, November edition (late)

Docker In Production: a History of Failure

A cautionary tale to counter some of the newbie hype around the new Infrastructure Jesus that is Docker. I’ve fallen prey to the hype as well, assuming that (a)Docker is ready for prime time, (b) Docker is universally beneficial for all workloads and (c) Docker is measurably superior to the infrastructure design patterns that it intends to replace.

That said, the article is long on complaints, and doesn’t attempt to back its claims with data, third-party verification or unemotional hyperbole. I’m sure we’ll see many counter-articles claiming “it works for me”, “I never saw these kinds of problems” and “what’s this guy’s agenda?”  I’ll still pay attention to commentary like this, because it reads to me like the brain dump of a person exhausted from chasing their tail all year trying to find a tech combo that they can just put in production and not devote unwarranted levels of monitoring and maintenance to. I think their expectations aren’t unreasonable. It sure sounds like the Docker team are more ambitious or cavalier than their position and staffing levels warrant.

Wat

This is one of the most hilarious and horrifying expeditions into the dark corners of (un?)intended consequences in coding languages.  Watching this made me feel like I’m more versed in the lessons of the absurd “stupid pet tricks” with many languages, even if I’d never use 99% of these in real life.  It also made me feel like “did someone deliberately allow these in the language design, or did some nearly-insane persons just end up naturally stumbling on these while trying to make the language do things it should never have done?”

Is Agile dying a slow death?  Or is it being reborn?

This guy captures all my attitudes about “Agile according to the rules” versus “getting an organization tuned to collaborate and learn as fast as possible”.  While extra/unnecessary process makes us feel like we have guard rails to keep people from making mistakes, in my experience what it *actually* does it drive DISengagement and risk aversion in most employees, knowing that unless they have explicit permission to break the rules, their great new idea is likely to attract organizational antibodies.

Stanford’s password policy shuns one-size-fits-all security

This is better than a Bigfoot sighting! An actual organization who’ve thought about security risk vs punishing anti-usability and come up with an approach that should satisfy both campaigns! This UX-in-security bigot can finally die a happy man.

A famed hacker is grading thousands of programs – and may revolutionise software in the process

May not get to the really grotty code security issues that are biting us some days, and probably giving a few CIOs a false sense of security.  Controversial?  Yes.

A necessary next step as software grows up as an engineering discipline? Absolutely.

Let’s see many more security geeks meeting the software developer where they live, and stop expecting em to voluntarily become part-time security experts just because someone came up with another terrific Hollywood Security Theater plot.

A Rebuttal for Python 3

Why are some old-school Pythonistas so damned pissy about Python 3 – to the point of (in at least one egregiously dishonest case) writing long articles trying to dissuade others from using it? Are they still butthurt at Guido for making breaking changes that don’t allow them to run their old Python 2 code on the Python 3 runtime? Do they not like change? Are they aware that humans are imperfect and sometimes have to admit mistakes/try something different? I find it fascinating to watch these kinds of holy wars – it gives the best kinds of insights into what frailties and hot buttons really motivate people.

The best quote’s in the comments: “Wow, I haven’t seen this much bullshit in a “technical” article in a while. A Donald Trump transcript is more honest and informative than that. I seriously doubt Zed Shaw himself believes a single paragraph there; if he actually does, he should stop acting like a Python expert and admit he’s an idiot.”

How The Web Became Unreadable

It’s painful to see some designers slavishly devote their efforts more to the third hand fashion they hear about from other designers, than to the end users of the sites and services to which they deliver their designs. I love a lot of the design work that’s come out the last few years – the jumbled mess that was web design ten years ago was painful – but the practical implications of how that design is consumed in the wild must be paramount.  And it is where I am the final decision maker on shipping software.

Mashing the marvelous wrapper until it responds, part 1: prereq/setup

I haven’t used a dynamic language for coding nearly as much as strongly-typed, compiled languages so approaching Python was a little nervous-making for me.  It’s not every day you look into the abyss of your own technical inadequacies and find a way to keep going.

Here’s how embarrassing it got for me: I knew enough to clone the code to my computer and to copy the example code into a .py file, but beyond that it felt like I was doing the same thing I always do when learning a new language: trying to guess at the basics of using the language that everyone who’s writing about it already knows and has long since forgotten, they’re so obvious.  Obvious to everyone but the neophyte.

Second, is that I don’t respond well to the canonical means of learning a language (at least according to all the “Learn [language_x] from scratch” books I’ve picked up over the years), which is

  • Chapter 1: History, Philosophy and Holy Wars of the Language
  • Chapter 2: Installing The Author’s Favourite IDE
  • Chapter 3: Everything You Don’t Have a Use For in Data Types
  • Chapter 4: Advanced Usage of Variables, Consts and Polymorphism
  • Chapter 5: Hello World
  • Chapter 6: Why Hello World Is a Terrible Lesson
  • Chapter 7: Author’s Favourite Language Tricks

… etc.

I tend to learn best by attacking a specific, relevant problem hands-on – having a real problem I felt motivated to attack is how these projects came to be (EFSCertUpdater, CacheMyWork).  So for now, despite a near-complete lack of context or mentors, I decided to dive into the code and start monkeying with it.

Riches of Embarrassment

I quickly found a number of “learning opportunities” – I didn’t know how to:

  1. Run the example script (hint: install the python package for your OS, make sure the python binary is in your current shell’s path, and don’t use the Windows Git Bash shell as there’s some weird bug currently at work)
  2. Install the dependencies (hint: run “pip install xxxx”, where “xxxx” is whatever shows up at the end of an error message like this:
    C:\Users\Mike\code\marvelous>python example.py 
    Traceback (most recent call last):     
        File "example.py", line 5, in <module>
            from config import public_key, private_key 
    ImportError: No module named config

    In this example, I ran “pip install config” to resolve this error.

  3. Set the public & private keys (hint: there was some mention of setting environment variables, but it turns out that for this example script I had to paste them into a file named “config” – no, for python the file needs to be named “config.py even though it’s text not a script you would run on its own – and make sure the config.py file is stored in the same folder as the script you’re running.  Its contents should look similar to these (no, these aren’t really my keys):
        public_key = 81c4290c6c8bcf234abd85970837c97 
        private_key = c11d3f61b57a60997234abdbaf65598e5b96

    Nope, don’t forget – when you declare a variable in most languages, and the variable is not a numeric value, you have to wrap the variable’s value in some type of quotation marks.  [Y’see, this is one of the things that bugs me about languages that don’t enforce strong typing – without it, it’s easy for casual users to forget how strings have to be handled]:

        public_key = '81c4290c6c8bcf234abd85970837c97' 
        private_key = 'c11d3f61b57a60997234abdbaf65598e5b96'
  4. Properly call into other Classes in your code – I started to notice in Robert’s Marvelous wrapper that his Python code would do things like this – the comic.py file defined
         class ComicSchema(Schema):

    …and the calling code would state

        import comic 
        … 
        schema = comic.ComicSchema()

    This was initially confusing to me, because I’m used to compiled languages like C# where you import the defined name of the Class, not the filename container in which the class is defined.  If this were C# code, the calling code would probably look more like this:

        using ComicSchema;
        … 
        _schema Schema = ComicSchema();

    (Yes, I’m sure I’ve borked the C# syntax somehow, but for sake of this sad explanation, I hope you get the idea where my brain started out.)

    I’m inferring that for a scripted/dynamic language like Python, the Python interpreter doesn’t have any preconceived notion of where to find the Classes – it has to be instructed to look at specific files first (import comic, which I’m guessing implies import comic.py), then further to inspect a specified file for the Class of interest (schema = comic.ComicSchema(), where comic. indicates the file to inspect for the ComicSchema() class).

Status: Learning

So far, I’m feeling (a) stupid that I have to admit these were not things with which I sprang from the womb, (b) grateful Python’s not *more* punishing, (c) smart-ish that fundamental debugging is something I’ve still retained and (d) good that I can pass along these lessons to other folks like me.

Coding Again? Experimenting with the Marvel API

I’ve been hanging around developers *entirely* too much lately.

These days I find myself telling myself the story that unless I get back into coding, I’m not going to be relevant in the tech industry any longer.

Hanging out (aka volunteering) at developer-focused conferences will do that to you:

Volunteering on open source projects will do that to you (jQuery Foundation‘s infrastructure team).

Interviewing for engineering-focused Product Owner and Technical Product Manager roles will do that to you. (Note: when did “technical” become equivalent to “I actively code in my day job/spare time”?)

One of the hang-ups I have that keeps me from investing the immense amount of grinding time it takes to make working code is that I haven’t found an itch to scratch that bugs me enough that I’m willing to commit myself to the effort. Plenty of ideas float their way past my brain, but very few (like CacheMyWork) get me emotionally engaged enough to surmount the activation energy necessary to fight alone past all the barriers: lonely nights, painful problem articulation, lack of buddy to work on it, and general frustration that I don’t know all the tricks and vocabulary that most good coders do.

Well, it finally happened. I found something that should keep me engaged: creating a stripped-down search interface into the Marvel comics catalogue.  Marvel.com provides a search on their site but I:

  1. keep forgetting where they buried it,
  2. find it cumbersome and slow to use, and
  3. never know if the missing references (e.g. appearances of Captain Marvel as a guest in others’ comics that aren’t returned in the search results) are because the search doesn’t work, or because the data is actually missing

Marvel launched an API a couple of years ago – I heard about it at the time and felt excited that my favourite comics publisher had embraced the Age of APIs.  But didn’t feel like doing anything with it.

Fast forward two years: I’m a diehard user of Marvel Unlimited, my comics reading is about half-Marvel these days, and I’m spending a lot of time trying to weave together a picture of how the characters relate, when they’ve bumped into each other, what issue certain happenings occurred in, etc

Possible questions I could answer if I write some code:

  • How socially-connected is Spidey compared with Wolverine?
  • When is the first appearance of any character?
  • What’s the chronological publication order of every comic crossover in any comics Event?

Possible language to use:

  • C# (know it)
  • F# (big hawtness at the .NET Fringe conf)
  • Python (feel like I should learn it)
  • Typescript (ES6 – like JavaScript with static types and other frustration-killers)
  • ScriptCS (a scriptable C#)

More important than choice of language though is availability of wrappers for the API – while I’m sure it would be very instructive to immediately start climbing the cliff of building “zero tech” code, I learn far faster when I have visible results, than when I’m still fiddling with getting the right types for my variables or trying to remember where and when to set the right kind of closing braces.

So for sake of argument I’m going to try out the second package I found – Robert Kuykendall’s “marvelous” python wrapper: https://github.com/rkuykendall/marvelous

See you when I’ve got something to report.

What I learned today about server-side programming

Attended PDX Web & Design meetup tonight, where Eric Redmond delivered a primer on modern server-side programming. What’s cool about Eric’s talks is he gives us a smattering of the stuff that’s he’s touched in the last few months – and he’s got a voracious appetite for fun, productive tools that support the lazy programmer. (Kindred spirit here – don’t make me do anything more than I absolutely have to to get the job done.)

Here’s the notes I took tonight as Eric was entertaining me (and the rest of the room). Mostly noted what technologies he mentioned, a few key lessons, and a few opinions that resonated with me.

http://www.socketstream.com (& GitHub)
– CoffeeScript
– Socketracer.com
– “ws://” protocol
– Server-side frameworks:
— PHP (easy/messy – bad programming practices)
— Express.js (difficult) – built on Node.js
— Sinatra (just right – “Ruby on Rails’ little brother”)

– CMD shell: SharpEnviro (win), iTerm (mac)
– txt editor: Notepad++ (win), Sublime Edit (mac) — or Sublime for windows
– httpd for win: WinLAMP or XAMPP
– Codeacademy.com/Tracks/JavaScript
– heroku.com – hosting provider for hosting Express.js apps (and tons more frameworks)
– Express.js takes care of implementing things like sockets that Node.js makes you write on your own
– HTML templating languages = jade, slim
— jade templates remove , instead of “end block” tags it uses indenting to define hierarchy
– Sinatra – ruby-lang.org, sinatrarb.com, rubygems.org
– erlang is now Eric Redmond’s favoured programming language (doesn’t recommend it)
– TryRuby.org – teach yourself Ruby from scratch
– larger-scale frameworks (generally MVC): RailwayJS (RailwayJS.com), Django (www.djangoproject.com), Ruby on Rails (rubyonrails.org)
– CMS’s: refinery, redmine, typo3

“Learning to program teaches you how to think. Computer science is a liberal art.” — Steve Jobs

Considering learning Python – idle thought (until something catalyses it)

A friend-of-a-friend asked me this week:

Hi Mike, [mutual friend] referred me to you as a good person to ask: what’s the best way to learn Python for someone like me, whose programming skills are essentially 1990-era (I know Perl and C, but haven’t made the leep to object oriented stuff)? I’d like to leapfrog into the present era, and make web 2.0-ish-looking sites and experiments. Is there a particular web hosting service I should use? Thanks for any advice you might have.

Funny you should ask – I was just wondering this week whether I shouldn’t dive into Python as a quick-and-dirty prototyping language.  I’d fancied myself for years as someone who might be able to reinvent myself as a programmer, and I’ve muddled around with C# and VB for a few years now – but only in short spurts.  Every time I come back to it, I feel like I’m climbing a steep hill all over again.
For some reason, I get the impression that  for folks that are just whipping something together quickly, interpreted scripting languages like Python, Perl or Javascript are easier to deal with – less overhead, less setup, just diving in and getting to the business of making something happen.  I’ve always felt like if I wanted to call myself a coder, that I’d be “cheating” by taking this route, so I never allowed myself the freedom to try this out.  But at the same time, I was never brave/patient enough to mess around with low-level code like C or C++ (who wants to write hundreds of lines of memory-handling routines that managed code gives you for ‘free’?), so I guess I’ve left myself between a rock and hard place – not quite as easy as “just do it” but not really forcing myself to learn the really “worthy” stuff either.
How to learn Python?  An almost-colleague of mine took the leap and blogged his process for going deep – start at the bottom:
I’ve never done it myself, but I trust that Mark is a smart guy who doesn’t muck around for the sake of making himself “look smart” (feel miserable).
For me, forcing myself to learn to code was an exercise in frustrating false starts – until I found a problem I couldn’t solve any way but coding it myself, and a problem that pissed me off enough to keep slogging through failures and dead ends until I got something working.
Web hosting?  No idea.  I know a few big names (AWS, Rackspace, Google Apps) but I have no clue where to get the pre-built infrastructure to just upload .py and let fly.
Is this helpful?  Do you have something specific in mind?  If you’re working on something specific and looking to work loosely with one or a few others, I’d be interested in hearing what it is and whether it fires my “that sucks!” instinct enough to want to contribute/walk alongside.