AWS Tutorial wrangling, effort 2: HelloWorld via CodeDeploy

I’m taking a crash course in DevOps this winter, and our instructor assigned us a trivial task: get Hello World running in S3.

I found this tutorial (tantalizingly named “Deploy a Hello World Application with AWS CodeDeploy (Windows Server)”), figured it looked close enough (and if not, it’d keep me limber and help me narrow in on what I *do* need) so I foolishly dove right in.

TL;DR I found the tutorial damnably short on explicit clarity – lots of references to other tutorials, and plenty of incomplete or vague instructions, it seems this was designed by someone who’s already overly-familiar with AWS and didn’t realize the kinds of ambiguities they’d left behind.

I got myself all the way to Step 3 and was faced with this error – little did I know this was just the first of many IAM mysteries to solve:

Mac4Mike:aws mike$ aws iam attach-role-policy --role-name CodeDeployServiceRole --policy-arn arn:aws:iam::aws:policy:/service-role/AWSCodeDeployRole
An error occurred (AccessDenied) when calling the AttachRolePolicy operation: User: arn:aws:iam::720781686731:user/Mike is not authorized to perform: iam:AttachRolePolicy on resource: role CodeDeployServiceRole

Back the truck up, Mike

Hold on, what steps preceded this stumble?

CodeDeploy – Getting Started

Well, first I pursued the CodeDeploy Getting Started path:

  • Provision an IAM user
    • It wasn’t clear from the referring tutorials, so I learned by trial and error to create a user with CLI permissions (Access/Secret key authN), not Console permissions
  • Assigning them the specified policies (to enable CodeDeploy and CloudFormation)
  • Creating the specifiedCodeDeployServiceRole service role
    • The actual problem arose here, where I ran the command as specified in the guide

I tried this with different combinations of user context (Mike, who has all access to EC2, and Mike-CodeDeploy-cli, who has all the policies assigned in Step 2 of Getting Started) AND the –policy-arn parameter (both the Getting Started string and the one dumped out by the aws iam create-role command (arn:aws:iam::720781686731:role/CodeDeployServiceRole)).

And literally, searching on this error and variants of it, there appear to be no other people who’ve ever written about encountering this.  THAT’s a new one on me.  I’m not usually a trailblazer (even of the “how did he fuck this up *that* badly?” kind of trailblazing…)

OK, so then forget it – if the CLI + tutorial can’t be conquered, let’s try the Console-based tutorial steps.   [Note: in both places, they state it’s important that you “Make sure you are signed in to the AWS Management Console with the same account information you used in Getting Started.”  Why?  And what “account information” do they mean – the user with which you’re logged into the web console, or the user credentials you provisioned?]

I was able to edit the just-created CodeDeployServiceRole and confirm all the configurations they specified *except* where they state (in step 4), “On the Select Role Type page, with AWS Service Roles selected, next to AWS CodeDeploy, choose Select.”  Not sure what that means (which should’ve pulled me in the direction of “delete and recreate this role”), but I tried it out as-is anyway.  [The only change I had to make so far was to attach AWSCodeDeployRole.]

Reading up on the AttachRolePolicy action, it appears that –policy-arn refers to the permissions you wish to attach to the targeted role, and –role-name refers to the role getting additional permissions.  That would mean I’m definitely meant to attach “arn:aws:iam::aws:policy/service-role/AWSCodeDeployRole” policy to CodeDeployServiceRole.  (Still doesn’t explain why I lack the AttachRolePolicy permission in either of the IAM Users I’ve defined, nor how to add that permission.)

Instead, with no help from any online docs or discussions, I discovered that it’s possible to assign the individual permission by starting with this interface: https://console.aws.amazon.com/iam/home?#/policies:

  • Click Create Policy
  • Select Policy Generator
  • AWS Service: Select AWS Identity and Access Management
  • Actions: Attach Role Policy
  • ARN: I tried constructing two policies with (arn:aws:iam::aws:policy/service-role/AWSCodeDeployRole) and (arn:aws:iam::720781686731:role/CodeDeployServiceRole)
    • The first returned “AccessDenied” and with the second, the command I’m fighting with returned “InvalidInput”

Then I went to the Users console:

  • select the user of interest (Mike-CodeDeploy-cli in my journey)
  • click Add Permissions
  • select “Attach existing policies directly”
  • select the new Policy I just created (where type = Customer managed, so it’s relatively easy to spot among all the other “AWS managed” policies)

As I mentioned, the second construction returned this error to the command:

An error occurred (InvalidInput) when calling the AttachRolePolicy operation: ARN arn:aws:iam::720781686731:role/CodeDeployServiceRole is not valid.

Nope, wait, dammit – the ARN is the *policy*, not the *object* to which the policy grants permission…

Here’s where I started ranting to myself…

Tried it a couple more times, still getting AccessDenied, so screw it.  [At this point I conclude AWS IAS is an immature dog’s breakfast – even with an explicit map you still end up turned in knots.]

So I just went to the Role CodeDeployServiceRole and attached both policies (I’m pretty sure I only need to attach the policy AWSCodeDeployRole but I’m adding the custom AttachRolePolicy-CodeDeployRole because f it I just need to get through this trivial exercise).

[Would it kill the folks at AWS to draw a friggin picture of how all these capabilities with their overlapping terminology are related?  Cause I don’t know about you, but I am at the end of my rope trying to keep these friggin things straight.  Instead, they have a superfluous set of fragmented documented and tutorials, which it’s clear they’ve never usability tested end-to-end, and for which they assume way too much existing knowledge & context.]

I completed the rest of the InstanceProfile creation steps (though I had to create a second one near the end, because the console complained I was trying to create one that already existed).

CodeDeploy – create a Windows instance the “easy” way

Then of course we’re on to the fun of creating a Windows Instance in AWS.  Brave as I am, I tried it with CloudFormation.

I grabbed the CLI command and substituted the following two Parameter values in the command:

  • –template-url:

    ttp://s3-us-west-2.amazonaws.com/aws-codedeploy-us-west-2/templates/latest/CodeDeploy_SampleCF_Template.json (for the us-west-2 region I am closest to)

  • Parameter-Key=KeyPairName: MBP-2009 (for the .pem file I created a while back for use in SSH-managing all my AWS operations)

The first time I ran the command it complained:

You must specify a region. You can also configure your region by running "aws configure".

So I re-ran aws configure and filled in “us-west-2” when it prompted for “Default region name”.

Second time around, it spat out:

{
    "StackId": "arn:aws:cloudformation:us-west-2:720781686731:stack/CodeDeployDemoStack/d1c817d0-d93e-11e6-8ee1-503f20f2ade6"
}

They tell us not to proceed until this command reports “CREATE_COMPLETED”, but wow does it take a while to stop reporting “None”:

aws cloudformation describe-stacks --stack-name CodeDeployDemoStack --query "Stacks[0].StackStats" --output text

When I went looking at the cloudformation console (blame it on lack of patience), it reported my instance(s)’ status was “ROLLBACK_COMPLETE”.  Now, I’m no AWS expert, but that doesn’t sound like a successful install to me.  I headed to the details, and of course something else went horribly wrong:

  • CREATE_FAILED – AWS::EC2::Instance – API: ec2:RunInstances Not authorized for images: [ami-7f634e4f]

CodeDeploy – create a Windows instance the “hard” way

So let’s forget the “easy” path of CloudFormation.  Try the old-fashioned way of creating a Windows instance, and see if I can make it through this:

  • Deciding among Windows server AMI’s is a real blast – over 600 of them!
  • I narrowed it down to the “Windows_Server-2016-English-Full-Base-2016.12.24”
    • Nano is only available to Windows Assurance customers
    • Enterprise is way more than I’d need to serve a web page
    • Full gives you the Windows GUI to manage the server, whereas Core only includes the PowerShell (and may not even allow RDP access)
    • I wanted to see what the Manage Server GUI looks like these days, otherwise I probably would’ve tried Core
    • Note: there were three AMI all prefixed “Windows_Server-2016-English-Full-Base”, I just chose the one with the latest date suffix (assuming it’s slightly more up-to-date with patches)
  • I used the EC2 console to get the Windows password, then installed the Microsoft Remote Desktop client for Mac to enable me to interactively log in to the instance

Next is configuring the S3 bucket:

  • There is some awfully confusing and incomplete documentation here
  • There are apparently two policies to be configured, with helpful sample policies, but it’s unclear where to go to attach them, or what steps to take to make sure this occurs
  • It’s like the author has already done this a hundred times and knows all the steps by heart, but has forgotten that as part of a tutorial, the intended audience are people like me who have little or no familiarity with the byzantine interfaces of AWS to figure out where to attach these policies [or any of the other hundred steps I’ve been through over the last few weeks]
  • I *think* I found where to attach the first policy (giving permission to the Amazon S3 Bucket) – I attached this policy template (substituting both the AWS account ID [111122223333] and bucket name [codedeploydemobucket] for the ones I’m using):
    { "Statement": [ { "Action": ["s3:PutObject"], "Effect": "Allow", "Resource": "arn:aws:s3:::codedeploydemobucket/*", "Principal": { "AWS": [ "111122223333" ] } } ] }
  • I also decided to attach the second recommended policy to the same bucket as another bucket policy:
    { "Statement": [ { "Action": ["s3:Get*", "s3:List*"], "Effect": "Allow", "Resource": "arn:aws:s3:::codedeploydemobucket/*", "Principal": { "AWS": [ "arn:aws:iam::80398EXAMPLE:role/CodeDeployDemo" ] } } ] }
  • Where did I finally attach them?  I went to the S3 console, clicked on the bucket I’m going to use (called “hacku-devops-testing”), selected the Properties button, expanded the Permissions section, and clicked the Add bucket policy button the first time.  The second time, since it would only allow me to edit the bucket policy, I tried Add more permissions – but that don’t work, so I tried editing the damned bucket policy by hand and appending the second policy as another item in the Statement dictionary – after a couple of tries, I found a combination that the AWS bucket policy editor would accept, so I’m praying this is the intended combination that will all this seductive tutorial to complete:
    {
     "Version": "2008-10-17",
     "Statement": [
     {
     "Effect": "Allow",
     "Principal": {
     "AWS": "arn:aws:iam::720781686731:root"
     },
     "Action": "s3:PutObject",
     "Resource": "arn:aws:s3:::hacku-devops-testing/*"
     },
     {
     "Action": [
     "s3:Get*",
     "s3:List*"
     ],
     "Effect": "Allow",
     "Resource": "arn:aws:s3:::hacku-devops-testing/*",
     "Principal": {
     "AWS": [
     "arn:aws:iam::720781686731:role/CodeDeployDemo-EC2-Instance-Profile"
     ]
     }
     }
     ]
    }

CodeDeploy – actually deploying code

I followed the remaining commands (closely – gotta watch every parameter and fill in the correct details, ugh).  But thankfully this was the trivial part.  [I guess they got the name “CodeDeploy” right – it’s far more attractive than “CodeDeployOnceYouFoundTheLostArkOfTheCovenantToDecipherIAMIntricacies”.]

Result

Success!  Browsing to the public DNS of the EC2 instance showed me the Hello World page I’ve been trying to muster for the past three days!

Conclusion

This tutorial works as a demonstration of how to marshall a number of contributing parts of the AWS stack: CodeDeploy (whose “ease of deployment” benefits I can’t yet appreciate, considering how labourious and incomplete/error-prone this tutorial was), IAM (users, roles, groups, policies), S3, EC2 and AMI.

However, as a gentle introduction to a quick way to get some static HTML on an AWS endpoint, this is a terrible failure.  I attacked this over a span of three days.  Most of my challenges were in deciphering the mysteries of IAM across the various layers of AWS.

In a previous life I was a security infrastructure consultant, employed by Microsoft to help decipher and troubleshoot complex interoperable security infrastructures.  I prided myself on being able to take this down to the lowest levels and figure out *exactly* what’s going wrong.  And while I was able to find *a* working pathway through this maze, my experience here and my previous expertise tells me that AWS has a long way to go to make it easy for AWS customers to marshall all their resources for secure-by-default application deployments.  [Hell, I didn’t even try to enhance the default policies I encountered to limit the scope of what remote endpoints or roles would have access to the HTTP and SSH endpoints on my EC2 instance.  Maybe that’s a lesson for next time?]

 

Advertisements

Update my Contacts with Python: using pyobjc, Contacts.app & vCards, Swift or my own two hands?

I’m still on a mission to update my iCloud Contacts using PyiCloud to consolidate the data I’ve retrieved from LinkedIn.  Last time I convinced myself to add an update_contact() function to a fork of PyiCloud’s contacts module, and so far I haven’t had any nibbles on the issue I’ve filed in the PyiCloud project a couple of days ago.

I was looking further at the one possibly-working pattern in the PyiCloud project that appears to implement a write back to the iCloud APIs: the reminders module with its post() method.  What’s interesting to me is that in that method, the JSON submitted in the data parameter includes the key:value pair “etag”: None.  I gnashed my teeth over how to construct a valid etag in my last post, and this code implies to me (assuming it’s still valid and working against the Reminders API) that the etag value is optional (well, the key must be specified, but the complicated value may not be needed).

Knowing that this sounds too easy, I watched a new Reminder getting created through the icloud.com web client, and sure enough Chrome Dev Tools shows me that in the Request Payload, etag is set to null.  Which really tells me nothing now about the requirement for the Contacts API…

Arrested Development

Knowing that this was going to be a painful brick wall to climb, I decided to pair up with a python expert to look for ways to dig out from this deep, dark hole.  Lucky me, I have a good relationship with the instructor from my python class from late last year.  We talked about where I am stuck and what he’d recommend I do to try to break through this issue.

His thinking?  He immediately abandoned the notion of deciphering an undocumented API and went looking around the web for docs and alternatives.  Turns out there are a couple of options:

  1. Apple has in its SDKs a Contacts framework that supports Swift and Objective-C
  2. There are many implementations of Python & other languages that access the MacOS Contacts application (Contacts.app)

Contacts via Objective-C on MacOS

  • Contacts Framework is available in XCode
  • There appears to be a bidirectional bridge between Python and Objective-C
  • There is further a wrapper for the Contacts framework (which gets installed when you run pip install pyobjc)
  • But sadly, there is nothing even resembling a starter kit example script for instantiating and using the Contacts framework wrapper

Contacts via Contacts.app on MacOS

  • We found a decent-looking project (VObject) that purports to access VCard files, which is the underlying  data layout for import/export from Contacts.app
  • And another long-lived project (vcard) for validating VCards
  • This means I would have to manually import VCard file(s) into Contacts.app, and would still have to figure out how Contacts.app knows how to match/overwrite an imported Contact with an existing Contact (or I’ll always be backing up, deleting and importing)
  • HOWEVER, in exploring the content of the my Contacts.app and comparing to what I have in my iPhone Contacts, there’s definitely something extra going on here
    • I have at least one contact displayed in Contacts.app who is neither listed in my iPhone/iCloud contacts nor Google Contacts – given the well-formed LinkedIn data in the contact record, I’m guessing this is being implicitly included via Internet Accounts (the LinkedIn account configured here):
      screenshot-2017-01-06-09-45-24
    • What would happen if I imported a vCard with the same UID (the iCloud UUID)?
    • What would happen if I imported a vCard that exists in both iCloud and LinkedIn – would the iCloud (U)UID correctly match and merge the vCard to the right contact, or would we get a duplicate?
  • Here at least I see others acknowledge that it’s possible to create non-standard types for ADR, TEL (and presumably email and URL types, if they’re different).
  • Watch out: if you have any non-ASCII characters in your Address Book, exporting will generate the output as UTF-16.
  • Watch out: here’s a VObject gotcha.

Crazy Talk: Swift?

  • I *could* go learn enough Swift to interface with the JSON data I construct in Python
  • There’s certainly a plethora of articles (iOS-focused) and tutorials to help folks use the Contacts framework via Swift – which all seem to assume you want to build a UI app (not just a script) – I guess I understand the bias, but boy do I feel left out just wanting to create a one-time-use script to make sure I don’t fat-finger something and lose precious data (my wetware memory is lossy enough as it is)

Conclusion: Park it for now

What started out as a finite-looking piece of work to pull LinkedIn data into my current contacts of record, turned into a never-ending series of questions, murky code pathfinding  and band-aiding multiple technologies together to do something that should ostensibly be fairly straightforward.

Given that the best options I have at this point are (a) reverse-engineer an undocumented Apple API, (b) try to leverage an Objective-C bridge that *no one* else has tried to use for this new Contacts framework, or (c) decipher how Contacts.app interacts in the presence of vCards and all the interlocking contacts services (iCloud, Google, LinkedIn, Facebook)…I’m going to step away from this for a bit, let my brain tease apart what I’m willing to do for fun and how much more effort I’m willing to put in for fun/”the community”, and whether I’ve crossed The Line for a one-time effort and should just manually enter the data myself.

Notes to self: installing Vagrant via homebrew

It’s as easy as

brew install vagrant

Ha! Nice try there, buddy:

Error: No available formula with the name "vagrant" 
==> Searching for similarly named formulae...
Error: No similarly named formulae found.
==> Searching taps...
These formulae were found in taps:
homebrew/completions/vagrant-completion  Caskroom/cask/vagrant-manager
Caskroom/cask/vagrant-bar                Caskroom/cask/vagrant
To install one of them, run (for example):
  brew install homebrew/completions/vagrant-completion

According to this discussion, it’s not allowed by homebrew but the homebrew-cask project enables users to install vagrant.  (So *that’s* why I couldn’t get Chrome, Dropbox, 1Password, VLC and other apps installed view homebrew – there’s some rule or constraint in homebrew that only enables non-GUI apps.)

So let’s install the Cask tools – this article makes me feel silly even – you just have to know to use the “cask” keyword, as in:

brew cask install vagrant

Thus do I get vagrant v1.9.1 in one (slightly unexpected) command line.  (And don’t forget virtualbox and vagrant-manager!)

Which is super-cool, because many package managers end up with non-current builds of the tools in their catalogs.

And as a bonus, they mention a bunch of quicklook plugins that I never even thought to go looking for – markdown, syntax-highlighted code,  JSON, CSV and more!

Occupied Neurons, February edition

Threaded messaging comes to Slack

https://slackhq.com/threaded-messaging-comes-to-slack-417ffba054bd#.no3gqihm5

Not a freakin moment too soon. One of my all-time top complaints of IRC (Hipchat, Slack) is the impossible-to-skim for relevancy problem – when there’s dozens of messages a day, all of them treated with the same level of a flat hierarchy of information, how do you figure out which to ignore, without reading each one (or just declaring bankruptcy on a regular basis)?

Docker in Production: a History of Failure

https://thehftguy.com/2016/11/01/docker-in-production-an-history-of-failure/

I read this article a couple of months ago and had the hardest time tracking it down again. It’s inflammatory, strident and probably over-emphasizes the problems vs. benefits…BUT, I still think it’s a good read. We technologists need to pursue new technologies with both eyes wide open, so we can mitigate the risks, especially when problems arise.

Top JavaScript Frameworks and Topics to Learn in 2017

https://medium.com/javascript-scene/top-javascript-frameworks-topics-to-learn-in-2017-700a397b711

Simon Sinek on Millenials in the Workplace

What Makes a Team Great

http://www.barryovereem.com/what-makes-a-team-great/

The Greatest Sales Deck I’ve Ever Seen

https://www.linkedin.com/pulse/greatest-sales-deck-ive-ever-seen-andy-raskin

The Most Popular DevOps Stories of 2016

https://medium.com/@eon01/the-most-popular-devops-stories-in-2016-954d10698d67#.7c13xaimk

Update my Contacts with Python: thinking about how far to extend PyiCloud to enable PUT request?

I’m on a mission to use PyiCloud to update my iCloud Contacts with data I’m scraping out of LinkedIn, as you see in my last post.

From what I can tell, PyiCloud doesn’t currently implement support for editing existing Contacts.  I’m a little out of my depth here (constructing lower-level requests against an undocumented API) and while I’ve opened an issue with PyiCloud (on the off-chance someone else has dug into this), I’ll likely have to roll up my sleeves and brute force this on my own.

[What the hell does “roll up my sleeves” refer to anyway?  I mean, I get the translation, but where exactly did this start?  Was this something that blacksmiths did, so they didn’t burn the cuffs of their shirts?  Who wears a cuffed shirt when blacksmithing?  Why wouldn’t you go shirtless when you’re going to be dripping with sweat?  Why does one question always lead to a half-dozen more…?]

Summary: What Do I Know?

  • LinkedIn’s Contacts API can dump most of the useful data about each of your own Connections – connectionDate, profileImageUrl, company, title, phoneNumbers plus Tags (until this data gets EOL’d)
  • LinkedIn’s User Data Archive can supplement with email address (for the foreseeable) and Notes and Tags (until this data gets EOL’d)
  • I’ve figured out enough code to extract all the Contacts API data, and I’m confident it’ll be trivial to match the User Data Archive info (slightly less trivial when those fields are already populated in the iCloud Contact)
  • PyiCloud makes it darned easy to successfully authenticate and read in data from the iCloud contacts – which means I have access to the contactID for existing iCloud Contacts
  • iCloud appears to use an idempotent PUT request to write changes to existing Contacts, so that as long as all required data/metadata is submitted in the request, it should be technically feasible to push additional data into my existing Contacts
  • It appears there are few if any required fields in any iCloud Contact object – the fields I have seen submitted for an existing Contact include firstName, middleName, lastName, prefix, suffix, isCompany, contactId and etag – and I’m not convinced that any but contactID are truly necessary (but instead merely sent by the iCloud.com web client out of “habit”)
  • The PUT operation includes a number of parameters on the request’s querystring:
    • clientBuildNumber
    • clientId
    • clientMasteringNumber
    • clientVersion
    • dsid
    • method
    • prefToken
    • syncToken
  • There are a large number of cookies sent in the request:
    • X_APPLE_WEB_KB–QNQ-TAKYCIDWSAXU3JXP7DXMBG
    • X-APPLE-WEBAUTH-HSA-TRUST
    • X-APPLE-WEBAUTH-LOGIN
    • X-APPLE-WEBAUTH-USER
    • X-APPLE-WEBAUTH-PCS-Cloudkit
    • X-APPLE-WEBAUTH-PCS-Documents
    • X-APPLE-WEBAUTH-PCS-Mail
    • X-APPLE-WEBAUTH-PCS-News
    • X-APPLE-WEBAUTH-PCS-Notes
    • X-APPLE-WEBAUTH-PCS-Photos
    • X-APPLE-WEBAUTH-PCS-Sharing
    • X-APPLE-WEBAUTH-VALIDATE
    • X-APPLE-WEB-ID
    • X-APPLE-WEBAUTH-TOKEN

Questions I have that (I believe) need an answer

  1. Are any of the PUT request’s querystring parameters established per-session, or are they all long-lived “static” values that only change either per-user or per-version of the API?
  2. How many of the cookies are established per-user vs per-session?
  3. How many of the cookies are being marshalled already by PyiCloud?
  4. How many of the cookies are necessary to successfully PUT a Contact?
  5. How do I properly add the request payload to a web request using the PyiCloud functions?  How’s about if I have to drop down to the requests package?

So let’s run these down one by one (to the best of my analytic ability to spot the details).

(1) PUT request querystring parameter lifetime

When I examine the request parameters submitted on two different days (but using the same Chrome process) or across two different browsers (but on the same day), I see the following:

  1. clientBuildNumber is the same (16HProject79)
  2. clientMasteringNumber is the same (16H71)
  3. clientVersion is the same (2.1)
  4. dsid is the same (197715384)
  5. method is obviously the same (PUT)
  6. prefToken is the same (914266d4-387b-4e13-a814-7e1b29e001c3)
  7. clientId uses a different UUID (C1D3EB4C-2300-4F3C-8219-F7951580D3FD vs. 792EFA4A-5A0D-47E9-A1A5-2FF8FFAF603A)
  8. syncToken is somewhat different (DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1432 vs. DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1427)
    • which if iCloud is using standard URL encoding translates to DAVST-V1-p28-FT=-@RU=afe27ad8-80ce-4ba8-985e-ec4e365bc6d3@S=1427
    • which means the S variable varies and nothing else

Looking at the PyiCloud source, I can find places where PyiCloud generates nearly all the params:

  • base.py: clientBuildNumber (14E45), dsid (from server’s authentication response), clientId (a fresh UUID on each session)
  • contacts.py: clientVersion (2.1), prefToken (from the refresh_service() function), syncToken (from the refresh_service() function)

Since the others (clientMasteringNumber, method) are static values, there are no mysteries to infer in generating the querystring params, just code to construct.

Further, I notice that the contents of syncToken is nearly identical to the etag in the request payload:

syncToken: DAVST-V1-p28-FT=-@RU=afe27ad8-80ce-4ba8-985e-ec4e365bc6d3@S=1436
etag: C=1435@U=afe27ad8-80ce-4ba8-985e-ec4e365bc6d3

This means not only that (a) the client and/or server are incrementing some value on some unknown cadence or stepping function, but also that (b) the headers and the payload have to both contain this value.  I don’t know if any code in PyiCloud has performed this (b) kind of coordination elsewhere, but I haven’t noticed evidence of it in my reviews of the code so far.

It should be easy enough to extract the RU and S param values from syncToken and plop them into the C and U params of etag.

ISSUE

The only remaining question is, does etag’s C param get strongly validated at the server (i.e. not only that it exists, and is a four-digit number, but that its value is strongly related to syncToken’s S param)?  And if so, what exactly is the algorithm that relates C to S?  In my anecdotal observations, I’ve noticed they’re always slightly different, from off-by-one to as much as a difference of 7.

(2) How many cookies are established per-session?

Of all the cookies being tracked, only these are identical from session to session:

  • X-APPLE-WEBAUTH-USER
  • X-APPLE-WEB-ID

The rest seem to start with the same string but diverge somewhere in the middle, so it’s safe to say each cookie changes from session to session.

 

(3) How many cookies are marshalled by PyiCloud?

I can’t find any of these cookies being generated explicitly, but I did notice the base.py module mentions X-APPLE-WEBAUTH-HSA-TRUST in a comment (“Re-authenticate, which will both update the 2FA data, and ensure that we save the X-APPLE-WEBAUTH-HSA-TRUST cookie.”) and fingers X-APPLE-WEBAUTH-TOKEN in an exception thrower (“reason == ‘Missing X-APPLE-WEBAUTH-TOKEN cookie'”), so presumably most or all of these are being similarly handled.

I tried for a bit to get PyiCloud to cough up the cookies sent down from iCloud during initial session setup, but didn’t get anywhere.  I also tried to figure out where they’re being cached on my filesystem, but I haven’t yet figured out where the user’s tmp directory lives on MacOS.

(4) How many cookies are necessary to successfully PUT a Contact?

This’ll have to wait to be answered until we actually start throwing code at the endpoint.

For now, it’s probably a reasonable assumption for now that PyiCloud is able to automatically capture and replay all cookies needed by the Contacts endpoint, until we run into otherwise-unexplained errors.

(5) How to add the request payload to endpoint requests?

I can’t seem to find any pattern in the PyiCloud code that already POSTs or PUTs a dictionary of data payload back to the iCloud services, so that may be out.

I can see that it should be trivial to attach the payload data to a requests.put() call, if we ignore the cookies and preceding authentication for a second.  If I’m reading the requests quickstart correctly, the PUT request could be formed like this:

import requests
url = 'https://p28-contactsws.icloud.com/co/contacts/card/'
data_payload = {"key1" : "value1", "key2" : "value2",  ...}
url_params = {"contacts":[{contact_attributes_dictionary}]}
r = requests.put(url, data = data_payload, params = url_params)

Where key(#s) includes clientBuildNumber, clientId, clientMasteringNumber, clientVersion, dsid, method, prefToken, syncToken, and contact_attributes_dictionary includes whichever fields exist or are being added to my Contacts (e.g. firstName, lastName, phones, emailAddresses, contactId) plus the possibly-troublesome etag.

What feels tricky to me is to try to leverage PyiCloud as far as I can and then drop to the reuqests package only for generating the PUT requests back to the server.  I have a bad feeling I might have to re-implement much of the contacts.py and/or base.py modules to actually complete authentication + cookies + PUT request successfully.

I do see the same pattern used for the authentication POST, for example (in base.py’s PyiCloudService class’ authenticate() function):

req = self.session.post(
 self._base_login_url,
 params=self.params,
 data=json.dumps(data)
 )

Extension ideas

This all leads me to the conclusion that, if PyiCloud is already properly handling authentication & cookies correctly, that it shouldn’t be too hard to add a new function to the contacts.py module and generate the URL params and the data payload.

update_contact()

e.g. define an update_contact() function:

def update_contact(self, contact_dict)

# read value of syncToken
# pull out the value of the RU and S params 
# generate the etag as ("C=" + str(int(s_param) - (increment_or_decrement)) + "@U=" + ru_param
# append etag to contact_dict
# read in session params from session object as session_params ???
# contacts_url = 'https://p28-contactsws.icloud.com/co/contacts/card/'
# req = self.session.post(contacts_url, params=session_params, data=json.dumps(contact_dict))

The most interesting/scary part of all this is that if the user [i.e. anyone but me, and probably even me as well] wasn’t careful, they could easily overwrite the contents of an existing iCloud Contact with a PUT that wiped out existing attributes of the Contact, or overwrote attributes with the wrong data.  For example, what if in generating the contact_dict, they forgot to add the lastName attribute, or they mistakenly swapped the lastName attribute for the firstName attribute?

It makes me want to wrap this function in all sorts of warnings and caveats, which are mostly ignored and aren’t much help to those who fat-finger their code.  And even to generate an offline, client-side backup of all the existing Contacts before making any changes to iCloud, so that if things went horribly wrong, the user could simply restore the backup of their Contacts and at least be no worse than when they started.

edit_contact()

It might also be advisable to write an edit_contact(self, contact_dict, attribute_changes_dict) helper function that at least:

  • takes in the existing Contact (presumably as retrieved from iCloud)
  • enumerated the existing attributes of the contact
  • simplified the formatting of some of the inner array data like emailAddresses and phones so that these especially didn’t get accidentally wiped out
  • (came up with some other validation rules – e.g. limit the attributes written to contact_dict to those non-custom attributes already available in iCloud, e.g. try to help user not to overwrite existing data unless they explicitly set a flag)

And all of this hand-wringing and risk management would be reduced if the added code implemented some kind of visual UI so that the user could see exactly what they were about to irreversibly commit to their contacts.  It wouldn’t eliminate the risk, and it would be terribly irritating to page through dozens of screens of data for a bulk update (in the hopes of noticing one problem among dozens of false positives), but it would be great to see a side-by-side comparison between “data already in iCloud” and “changes you’re about to make”.

At which point, it might just be easier for the user to manually update their Contacts using iCloud.com.

Conclusion

I’m not about to re-implement much of the logic already available in iCloud.com.

I don’t even necessarily want to see my code PR’d into PyiCloud – at least and especially not without a serious discussion of the foreseeable consequences *and* how to address them without completely blowing up downstream users’ iCloud data.

But at the same time, I can’t see a way to insulate my update_contact() function from the existing PyiCloud package, so it looks like I’m going to have to fork it and make changes to the contacts module.

Update my Contacts with Python: exploring LinkedIn’s and iCloud’s Contact APIs

TL;DR Wow is it an adventure to decipher how to interact with undocumented web services like I found on LinkedIn and iCloud.  Migrating data from LinkedIn to iCloud looks possible, but I got stuck at implementing the PUT operation to iCloud using Python.

Background: Because I have a shoddy memory for details about all the people I meet, and because LinkedIn appears to be de-prioritizing their role as a professional contact manager, I want to make my iPhone Contacts my system of record for all data about people I meet professionally.  Which means scraping as much useful data as possible from LinkedIn and uploading it to iCloud Contacts (since my people-centric data is currently centered more around my iPhone than a Google Contacts approach).

In our last adventure, I stumbled across the a surprisingly well-formed and useful API for pulling data from LinkedIn about my Connections:

https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

Available Data

Which upon inspection of the results, gives me a lot of the data I was hoping to import into my iCloud Contacts:

  • crucial: Date we first connected on LinkedIn (“connectionDate” as time-since-epoch), Tags (“tags” as list of dictionaries), Picture (“profileImageUrl” as URI), first name (“firstName” as string), last name (“lastName” as string)
  • want: current company (“company” as dictionary), current title (“title” as string)
  • metadata: phone number (“phoneNumbers” as dictionary)

What doesn’t it give?  Notes, Twitter ID, web site addresses, previous companies, email address.  [What else does it give that could be useful?  LinkedIn profile URL (“profileUrl” as the permanent URL, not the “friendly URL” that many of us have generated such as https://www.linkedin.com/in/mikelonergan.  I can see how it would be helpful at a meetup to browse through my iPhone contacts to their LinkedIn profile to refresh myself on their work history.  Creepy, desperate, but something I’ve done a few times when I’m completely blanking.]

What can I get from the User Data Archive?  Notes are found in the Contacts.csv, and email address is found in Connections.csv.  Matching those two files’ data together with what I can pull from the Contacts API shouldn’t be a challenge (concat firstName + lastName, and among the data set of my 684 contacts, I doubt I’ll find any collisions).  Then matching those records to my iCloud Contacts *should* be just a little harder (I expect to match 50% of my existing contacts by emailAddress, then another fraction by phone number; the rest will likely be new records for my Contacts, with maybe one or two that I’ll have to merge by hand at the end).

Planning the “tracer bullet”

So what’s the smallest piece of code I can pull together to prove this scenario actually works?  It’ll need at least these features (assumes Python):

  1. can authenticate to LinkedIn via at least one supported protocol (e.g. OAuth 2.0)
  2. can pull down the first 10 JSON records from Contacts API and hold them in a list
  3. can enumerate the First + Last Name and pull out “title” for that record
  4. can authenticate to iCloud
    • Note: I may need to disable 2-factor authentication that is currently enabled on my account
  5. can find a matching First + Last Name in my iCloud Contacts
  6. can write the title field to the iCloud contact
    • Note: I’m worried least about existing data for the title field
  7. can upload the revised record to iCloud so that it replicates successfully to my iPhone

That should cover all the essential operations for the least-complicated data, without having to worry about edge cases like “what if the contact doesn’t exist in iCloud” or “what if there’s already data in the field I want to fill”.

Step 1: authenticate to LinkedIn

There are plenty of packages and modules on Github for accessing LinkedIn, but the ones I’ve evaluated all use the REST APIs, with their dual-secrets authentication mechanism, to get at the data.  (e.g. this one, this one, that one, another one).

Or am I making this more complicated than it is?  This python module simply used username + password in their call to an HTTP ‘endpoint’.  Let’s assume that judicious use of the requests package is sufficient for my needs.

I thought I’d build an anaconda kernel and a jupyter notebook to experiment with the modules I’m looking at.   And when I attempted to install the requests package in my new Anaconda environment, I get back this error:

LinkError:
Link error: Error: post-link failed for: openssl-1.0.2j-0

Quick search turns up a couple of open conda issues that don’t give me any immediate relief. OK, forget this for a bit – the “root” kernel will do fine for the moment.

Next let’s try this code and see what we get back:

import requests
r = requests.get('https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007', auth=('mikethecanuck@gmail.com', 'linkthis'))
r.status_code

Output is simply “401”.  Dang, authentication wasn’t *quite* that easy.

So I tried that URL in an incognito tab, and it displays this to me without an existing auth cookie:

{"status":"Member is not Logged in."}

And as soon as I open another tab in that incognito window and authenticate to the linkedin.com site, the first tab with that contacts query returns the detailed JSON I was expecting.

Digging deeper, it appears that when I authenticate to https://www.linkedin.com through the incognito tab, I receive back one cookie labelled “lidc”, and that an “lidc” cookie is also sent to the server on the successful request to the contacts API.

But setting the cookie manually with the value returned from a previous request still leads to 401 response:

url = 'https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007'
cookies = dict(lidc="b=OGST00:g=43:u=1:i=1482261556:t=1482347956:s=AQGoGetJeZPEDz3sJhm_2rQayX5ZsILo")
r2 = requests.get(url, cookies=cookies)

I tried two other approaches that people have used in the past – some even successfully with certain pages on LinkedIn – but eventually I decided that I’m getting ratholed on trying to reverse-engineer an undocumented (and more than likely unusually-constructed) API, when I can quite easily dump the data out of the API by hand and then do the rest of my work successfully.  (Yes I know that disqualifies me as a ‘real coder’, but I think we both know I was never going to win that medal – but I will win the medal for “results-oriented” not “pedantically chasing my tail”.)

Thus, knowing that I’ve got 684 connections on LinkedIn (saw that in the footer of a response), I submitted the following queries and copy-pasted the results into 4 separate .JSON files for offline processing:

https://www.linkedin.com/connected/api/v2/contacts?start=0&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

https://www.linkedin.com/connected/api/v2/contacts?start=200&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

https://www.linkedin.com/connected/api/v2/contacts?start=400&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

https://www.linkedin.com/connected/api/v2/contacts?start=600&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

Oddly, the four sets of results contain 196, 198, 200 and 84 items – they assert that I have 684 connections, but can only return 678 of them?  I guess that’s one of the consequences of dealing with a “free” data repository (even if it started out as mine).

Step 2: read the JSON file and parse a list of connections

I’m sure I could be more efficient than this, but as far as getting a working result, here’s the arrangement of code I used to start accessing structured list data from the Contacts API output I shunted to a file:

import json
import os
contacts_file = open("Connections-API-results.json")
contacts_data = contacts_file.read()
contacts_json = json.loads(contacts_data)
contacts_list = contacts_json['values']

Step 3: pulling data out of the list of connections

It turns out this is pretty easy, e.g.:

for contact in contacts_list:
 print(contact['name'], contact['title'])

Messing around a little further, trying to make sense of the connectionDate value from each record, I found that this returns an ISO 8601-style date string that I can use later:

import time
print(strftime("%Y-%m-%d", time.localtime(contacts_list[15]['connectionDate'] / 1000)))

e.g. for the record at index “15”, that returned 2007-03-15.

Data issue: it turns out that not all records have a profileImageUrl key (e.g. for those oddball security geeks among my contacts who refuse to publish a photo on their LinkedIn profile), so I got to handle my first expected exception 🙂

Assembling all the useful data for all my Connections I wanted into a single dictionary, I was able to make the following work (as you can find in my repo):

stripped_down_connections_list = []

for contact in contacts_list:
 name = contact['name']
 first_name = contact['firstName']
 last_name = contact['lastName']
 title = contact['title']
 company = contact['company']['name']
 date_first_connected = time.strftime("%Y-%m-%d", time.localtime(contact['connectionDate'] / 1000))

picture_url = ""
 try:
 picture_url = contact['profileImageUrl']
 except KeyError:
 pass

tags = []
for i in range(len(contact['tags'])):
tags.append(contact['tags'][i]['name'])

phone_number = ""
try:
 phone_number = {"type" : contact['phoneNumbers'][0]['type'], 
 "number" : contact['phoneNumbers'][0]['number']}
except IndexError:
 pass

stripped_down_connections_list.append({"firstName" : contact['firstName'], 
 "lastName" : contact['lastName'], 
 "title" : contact['title'], 
 "company" : contact['company']['name'],
 "connectionDate" : date_first_connected, 
 "profileImageUrl" : picture_url,
 "tags" : tags,
 "phoneNumber" : phone_number,})

Step 4: Authenticate to iCloud

For this step, I’m working with the pyicloud package, hoping that they’ve worked out both (a) Apple’s two-factor authentication and (b) read/write operations on iCloud Contacts.

I setup yet another jupyter notebook and tried out a couple of methods to import PyiCloud (based on these suggestions here), at least one of which does a fine job.  With picklepete’s suggested 2FA code added to the mix, I appear to be able to complete the authentication sequence to iCloud.

APPLE_ID = 'REPLACE@ME.COM'
APPLE_PASSWORD = 'REPLACEME'

from importlib.machinery import SourceFileLoader

foo = SourceFileLoader("pyicloud", "/Users/mike/code/pyicloud/pyicloud/__init__.py").load_module()
api = foo.PyiCloudService(APPLE_ID, APPLE_PASSWORD)

if api.requires_2fa:
    import click
    print("Two-factor authentication required. Your trusted devices are:")

    devices = api.trusted_devices
    for i, device in enumerate(devices):
        print(" %s: %s" % (i, device.get('deviceName',
        "SMS to %s" % device.get('phoneNumber'))))

    device = click.prompt('Which device would you like to use?', default=0)
    device = devices[device]
    if not api.send_verification_code(device):
        print("Failed to send verification code")
        sys.exit(1)

    code = click.prompt('Please enter validation code')
    if not api.validate_verification_code(device, code):
        print("Failed to verify verification code")
        sys.exit(1)

Step 5: matching on First + Last with iCloud

Caveat: there are a number of my contacts who have appended titles, certifications etc to their lastName field in LinkedIn, such that I won’t be able to match them exactly against my cloud-based contacts.

I’m not even worried about this step, because I quickly got worried about…

Step 6: write to the iCloud contacts (?)

Here’s where I’m stumped: I don’t think the PyiCloud package has any support for non-GET operations against the iCloud Contacts service.  There appears to be support for POST in the Reminders module, but not in any of the other services modules (including Contacts).

So I sniffed the wire traffic in Chrome Dev Tools, to see what’s being done when I make an update to any iCloud.com contact.  There’s two possible operations: a POST method call for a new contact, or a a PUT method call for an update to an existing contact.

Here’s the Request Payload for a new contact:

{“contacts”:[{“contactId”:”2EC49301-671B-431B-BC8C-9DE6AE15D21D”,”firstName”:”Tony”,”lastName”:”Stank”,”companyName”:”Stark Enterprises”,”isCompany”:false}]}

Here’s the Request Payload for an update to that existing contact (I added homepage URL):

{“contacts”:[{“firstName”:”Tony”,”lastName”:”Stank”,”contactId”:”2EC49301-671B-431B-BC8C-9DE6AE15D21D”,”prefix”:””,”companyName”:”Stark Enterprises”,”etag”:”C=1432@U=afe27ad8-80ce-4ba8-985e-ec4e365bc6d3″,”middleName”:””,”isCompany”:false,”suffix”:””,”urls”:[{“label”:”HOMEPAGE”,”field”:”http://stark.com”}]}]}

There are four requests being made for either type of change to iCloud contacts (at least via the iCloud.com web interface that I am using as a model for what the code should be doing):

  1. https://p28-contactsws.icloud.com/co/contacts/card/
  2. https://webcourier.push.apple.com/aps
  3. https://p28-contactsws.icloud.com/co/changeset
  4. https://feedbackws.icloud.com/reportStats

Here’s the details for these calls when I create a new Contact:

  1. Request URL: https://p28-contactsws.icloud.com/co/contacts/card/?clientBuildNumber=16HProject79&clientId=63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA&clientMasteringNumber=16H71&clientVersion=2.1&dsid=197715384&prefToken=914266d4-387b-4e13-a814-7e1b29e001c3&syncToken=DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1426
    Request Payload: {“contacts”:[{“contactId”:”E2DDB4F8-0594-476B-AED7-C2E537AFED4C”,”urls”:[{“label”:”HOMEPAGE”,”field”:”http://apple.com”}],”phones”:[{“label”:”MOBILE”,”field”:”(212) 555-1212″}],”emailAddresses”:[{“label”:”WORK”,”field”:”johnny.appleseed@apple.com”}],”firstName”:”Johnny”,”lastName”:”Appleseed”,”companyName”:”Apple”,”notes”:”Dummy contact for iCloud automation experiments”,”isCompany”:false}]}
  2. Request URL: https://p28-contactsws.icloud.com/co/changeset?clientBuildNumber=16HProject79&clientId=63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA&clientMasteringNumber=16H71&clientVersion=2.1&dsid=197715384&prefToken=914266d4-387b-4e13-a814-7e1b29e001c3&syncToken=DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1427
  3. Request URL: https://webcourier.push.apple.com/aps?tok=bc3dd94e754fd732ade052eead87a09098d3309e5bba05ed24272ede5601ae8e&ttl=43200
  4. Request URL: https://feedbackws.icloud.com/reportStats
    Request Payload: {“stats”:[{“httpMethod”:”POST”,”statusCode”:200,”hostname”:”www.icloud.com”,”urlPath”:”/co/contacts/card/”,”clientTiming”:395,”uncompressedResponseSize”:14469,”region”:”OR”,”country”:”US”,”time”:”Wed Dec 28 2016 12:13:48 GMT-0800 (PST) (1482956028436)”,”timezone”:”PST”,”browserLocale”:”en-us”,”statName”:”contactsRequestInfo”,”sessionID”:”63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA”,”platform”:”desktop”,”appName”:”contacts”,”isLiteAccount”:false},{“httpMethod”:”POST”,”statusCode”:200,”hostname”:”www.icloud.com”,”urlPath”:”/co/changeset”,”clientTiming”:237,”uncompressedResponseSize”:2,”region”:”OR”,”country”:”US”,”time”:”Wed Dec 28 2016 12:13:48 GMT-0800 (PST) (1482956028675)”,”timezone”:”PST”,”browserLocale”:”en-us”,”statName”:”contactsRequestInfo”,”sessionID”:”63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA”,”platform”:”desktop”,”appName”:”contacts”,”isLiteAccount”:false}]}

I am 99% sure that the only request that actually changes the Contact data is the first one (https://p28-contactsws.icloud.com/co/contacts/card/), so I’ll ignore the other three calls from here on out.

Here’s the details of the first request when I edit an existing Contact:

Request URL: https://p28-contactsws.icloud.com/co/contacts/card/?clientBuildNumber=16HProject79&clientId=792EFA4A-5A0D-47E9-A1A5-2FF8FFAF603A&clientMasteringNumber=16H71&clientVersion=2.1&dsid=197715384&method=PUT&prefToken=914266d4-387b-4e13-a814-7e1b29e001c3&syncToken=DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1427
Request Payload: {“contacts”:[{“lastName”:”Appleseed”,”notes”:”Dummy contact for iCloud automation experiments”,”contactId”:”E2DDB4F8-0594-476B-AED7-C2E537AFED4C”,”prefix”:””,”companyName”:”Apple”,”phones”:[{“field”:”(212) 555-1212″,”label”:”MOBILE”}],”isCompany”:false,”suffix”:””,”firstName”:”Johnny”,”urls”:[{“field”:”http://apple.com”,”label”:”HOMEPAGE”},{“label”:”HOME”,”field”:”http://johnny.name”}],”emailAddresses”:[{“field”:”johnny.appleseed@apple.com”,”label”:”WORK”}],”etag”:”C=1427@U=afe27ad8-80ce-4ba8-985e-ec4e365bc6d3″,”middleName”:””}]}

So here’s what’s puzzling me so far: both the POST (create) and PUT (edit) operations include a contactId parameter.  Its value is the same from POST to PUT (i.e. I believe that means it’s referencing the same record).  When I create a second new Contact, the contactId is different than the contactId submitted in the Request Payload for the first new Contact (so it’s presumably not a dummy value).  And yet when I look at the request/response for the initial page load when I click “+” and “New Contact”, I don’t see a request sent from the browser to the server (so the server isn’t sending down a contactID – not at that moment at least – perhaps it’s cached earlier?).

Explained another way, this is how I believe the sequence works (based on repeated analysis of the network traffic from Chrome to the iCloud endpoint and back):

  1. User loads icloud.com, Contacts page (#contacts), clicks “+” and selects “New Contact”
    • Browser sends no request, but rather builds the New Contact form from cached code
  2. User adds data and clicks the Done button for the new Contact
    • Browser sends POST request to https://p28-contactsws.icloud.com/co/contacts/card/ with a bunch of form data on the URL, a whole raft of cookies and the JSON request payload [including contactId=x]
    • Server sends response
  3. User clicks Edit on that new contact, updates some data and clicks Done
    • Browser sends PUT request to https://p28-contactsws.icloud.com/co/contacts/card/ with form data, cookies and JSON request payload [including the same contactId=x]
    • Server sends response

So the question is: if I’m creating a net-new Contact, how does the web client get a valid contactId that iCloud will accept?  Near as I can figure, digging through the javascript-packed.js this page uses, this is the function that generates a UUID at the client:

Contacts.Contact = Contacts.Record.extend({
 primaryKey: "contactId",
 contactId: CW.Record.attr(String, {
 defaultValue: function() {
 return CW.upperCaseUUID()
 }
 })

Using this function (IIUC):

UUID: function() {
 var e = new Array(36),
 t = 0,
 n = ["8", "9", "a", "b"];
 if (window.crypto && window.crypto.getRandomValues) {
 var r = new Uint8Array(18);
 crypto.getRandomValues(r);
 for (t = 0; t < 18; t++) e[t * 2 + 1] = (r[t] >> 4).toString(16), e[t * 2] = (r[t] & 15).toString(16);
 e[19] = n[r[9] >> 6]
 } else {
 while (t < 36) e[t] = (Math.random() * 16 | 0).toString(16), t++;
 e[19] = n[Math.random() * 4 | 0]
 }
 return e[8] = e[13] = e[18] = e[23] = "-", e[14] = "4", e.join("")
 }

[Aside: I sincerely hope this is a standard library for UUID, not something Apple wrote themselves.  If I ever think that I’m going to need to generate iCloud-compatible UUIDs.]

Whoa – Pause

I need to take a step back and re-examine my goals and what I can specifically address.  I have learned a lot about both LinkedIn and iCloud, but I didn’t set out to recreate them, just find a way to make consistent use of the data I already have.

Parsing PDFs using Python

I’m part of a project that has a need to import tabular data into a structured database, from PDF files that are based on digital or analog inputs.  [Digital input = PDF generated from computer applications; analog input = PDF generated from scanned paper documents.]

These are the preliminary research notes I made for myself a while ago that I am now publishing for reference by other project members.  These are neither conclusive nor comprehensive, but they are directionally relevant.

I.E. The amount of work it takes code to parse structured data from analog input PDFs is a significant hurdle, not to be underestimated (this blog post was the single most awe-inspiring find I made).  The strongest possible recommendation based on this research is GET AS MUCH OF THE DATA FROM DIGITAL SOURCES AS YOU CAN.

Packages/libraries/guidance

Evaluation of Packages

Possible issues

  • Encryption of the file
  • Compression of the file
  • Vector images, charts, graphs, other image formats
  • Form XObjects
  • Text contained in figures
  • Does text always appear in the same place on the page, or different every page/document?

PDF examples I tried parsing, to evaluate the packages

  • IRS 1040A
  • 2015-16-prelim-doc-web.pdf (Bellingham city budget)
    • Tabular data begins on page 30 (labelled Page 28)
    • PyPDF2 Parsing result: None of the tabular data is exported
    • SCARY: some financial tables are split across two pages
  • 2016-budget-highlights.pdf (Seattle city budget summary)
    • Tabular data begins on page 15-16 (labelled 15-16)
    • PyPDF2 Parsing result: this data parses out
  • FY2017 Proposed Budget-Lowell-MA (Lowell)
    • Financial tabular data starts at page 95-104, then 129-130, 138-139
    • More interesting are the small breakouts on subsequent pages e.g. 149, 151, 152, 162; 193, 195, 197
    • PyPDF2 Parsing result: all data I sampled appears to parse out

Experiment ideas

  • Build an example PDF for myself with XLS tables, and then see what comes out when the contents are parsed using one of these libraries
  • Build a script that spits out useful metadata about the document: which app/library generated it (e.g. Producer, Creator), size, # of pages
  • Build another script to verify there’s a non-trivial amount of ASCII/Unicode text in the document (I.e. so we confirm it doesn’t have to be OCR’d)

Experiments tried

Update my Contacts with Python: thinking through the options

Alright folks, that’s the bell.  When LinkedIn stops thinking of itself as a professional contact manager, you know there’s no profit in it, and it’s time to manage this stuff yourself.

Problem To Solve

I’ve been hemming and hawing for a couple of years, ever since Evernote shut down their Hello app, about how to remember who I’ve met and where I met them.  I’m a Meetup junkie (with no rehab in sight) and I’ve developed a decent network of friends and acquaintances that make it easy for me to attend new events and conferences in town – I’ll always “know” someone there (though not always remember why/how I know them or even what their name is).

When I first discovered Evernote Hello, it seemed like the perfect tool for me – provided me a timeline view of all the people I’d met, with rich notes on all the events I’d seen them at and where those places were.  It never entirely gelled, it sporadically did and did NOT support business card import (pay for play mostly), and it was only good for those people who gave me enough info for me to link them.  Even with all those imperfections, I remember regularly scanning that list (from a quiet corner at a meetup/party/conference) before approaching someone I *knew* I’d seen before, but couldn’t remember why.  [Google Glasses briefly promised to solve this problem for me too, but that tech is off somewhere, licking its wounds and promising to come back in ten years when we’re ready for it.]

What other options do I have, before settling in to “do it myself”?

  • Pay the big players e.g. SalesForce, LinkedIn
    • Salesforce: smallest SKUs I could find @ $25/month [nope]
    • LinkedIn “Sales” SKU: $65/month [NOPE]
  • Get a cheap/trustworthy/likely-to-survive-more-than-a-year app
    • Plenty of apps I’ve evaluated that sound sketchy, or likely to steal your data, or are so under-funded that they’re likely to die off in a few months

Requirements

Do it myself then.  Now I’ve got a smaller problem set to solve:

  1. Enforce synchronization between my iPhone Contacts.app, the iCloud replica (which isn’t a perfect replica) and my Google Contacts (which are a VERY spotty replica).
    • Actually, let’s be MVP about this: all I *need* right now is a way of automating edits to Contacts on my iPhone.  I assume that the most reliable way of doing this is to make edits to the iCloud.com copy of the contact and let it replicate down to my phone.
    • the Google Contacts sync is a future-proofing move, and one that theoretically sounded free (just needed to flip a toggle on my iPhone profile), but which in practice seems to be built so badly that only about 20% of my contacts have ever sync’d with Google
  2. Add/update information to my contacts such as photos, “first met” context (who introduced, what event met at) and other random details they’ve confessed to me (other attempts to hook my memory) – *WITHOUT* linking my iPhone contacts with either LinkedIn or Facebook (who will of course forever scrape all that data up to their cloud, which I do *not* want to do – to them or me).

Test the Sync

How can I test my requirements in the cheapest way possible?

  • Make hand edits to the iCloud.com contacts and check that it syncs to the iPhone Contacts.app
    • Result: sync to iPhone within seconds
  •  Make hand edits to contacts in Contacts.app and check that it syncs to iCloud.com contact
    • Result: sync to iCloud within seconds

OK, so once I have data that I want to add to an iCloud contact, and code (Python for me please!) that can write to iCloud contacts, it should be trivial to edit/append.

Here’s all the LinkedIn Data I Want

Data that’s crucial to remembering who someone is:

  • Date we first connected on LinkedIn
  • Tags
  • Notes
  • Picture

Additional data that can help me fill in context if I want to dig further:

  • current company
  • current title
  • Twitter ID
  • Web site addresses
  • Previous companies

And metadata that can help uniquely identify people when reading or writing from other directories:

  • Email address
  • Phone number

How to Get my LinkedIn connection data?

OK, so (as of 2016-12-15 at 12:30pm PST) there’s three ways I can think of pulling down the data I’ve peppered into my LinkedIn connections:

  1. User Data Archive: request an export of your user data from LinkedIn
  2. LinkedIn API: request data for specified Connections using LinkedIn’s supported developer APIs
  3. Web Scraping: iterate over every Connection and pull fields via CSS using e.g. Beautiful Soup

User Data Archive

This *sounds* like the most efficient and straightforward way to get this data.  The “Relationship Section” announcement even implies that I’ll get everything I want:

If you want to download your existing Notes and Tags, you’ll have the option to do so through March 31, 2017…. Your notes and tags will be in the file named Contacts.

The initial data dump included everything except a Contacts.csv file.  The later Complete_LinkedInDataExport_12-16-2016 [ISO 8601 anyone?] included the data promised and nearly nothing else:

  • Connections.csv: First Name, Last Name, Email Address, Current Company, Current Position, Tags
  • Contacts.csv: First Name, Last Name, Email (mostly blank), Notes, Tags

I didn’t expect to get Picture, but I was hoping for Date First Connected, and while the rest of the data isn’t strictly necessary, it’s certainly annoying that LinkedIn is so friggin frugal.

Regardless, I have almost no other source for pictures for my professional contacts, and that is pretty essential for recalling someone I’ve met only a handful of times, so while helpful, this wasn’t sufficient.

LinkedIn API

The next most reliable way to attack this data is to programmatically request it.  However, as I would’ve expected from this “roach motel” of user-generated data, they don’t even support an API to request all Connections from your user account (merely sign-in and submit data).

Where they do make reference to user data, it’s in a highly-regulated set of Member Profile fields:

  • With the r_basicprofile permission, you can get first-name, last-name, positions, picture-url plus some other data I don’t need
  • With the r_emailaddress permission, you can get the user’s primary email address
  • For developers accepted into “Apply with LinkedIn”, and with the r_fullprofile permission, you can further get date-of-birth and member-url-resources
  • For those “Apply with LinkedIn” developers who have the r_contactinfo permssion, you can further get phone-numbers and twitter-accounts

After registering a new application, I am immediately given the ability to grant the following permissions to my app: r_basicprofile, r_emailaddress.  That’ll get me picture-url, if I can figure out a way to enumerate all the Connections for my account.

(A half-hour sorting through Chrome Dev Tools’ Network outputs later…)

Looks like there’s a handy endpoint that lets the browser enumerate pretty much all the data I want:

https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

That bears further investigation.

Web Scraping

While this approach doesn’t have the built-in restrictions with the LinkedIn APIs, there’s at least three challenges I can forsee so far:

  1. LinkedIn requires authentication, and OAuth 2.0 at that (at least for API access).  Integrating OAuth into a Beautiful Soup script isn’t something I’ve heard of before, but I’m seeing some interesting code fragments and tutorials that could be helpful, and it appears that the requests package can do OAuth 1 & 2.
  2. LinkedIn has helpfully implemented the “infinite scroll” AJAX behaviour on the Connections page.
    • There are ways to work with this behaviour, but it sure feels cumbersome – to the point I almost feel like doing this work by hand would just be faster.
  3. Navigating automatically to each linked page (each Connection) from the Connections page isn’t something I am entirely confident about
    • Though I imagine it should be as easy as “for each Connection in Connections, load the page, then find the data with this CSS attribute, and store it in an array of whatever form you like”.  The mechanize package promises to make the link navigation easy.

Am I Ready for This Much Effort?

It sure feels like there’s a lot of barriers in the way to just collecting the info I’ve accumulated in LinkedIn about my connections.  Would it take me less time to just browse each connection page and hand copy/paste the data from LinkedIn to iCloud?  Almost certainly.  To together a Beautiful Soup + requests + various github modules solution would probably take me 20-30 hours I’m guessing, from all the reading and piecing together code fragments from various sources, to debugging and troubleshooting, to making something that spits out the data and then automatically uploads it without mucking up existing data.

Kinda takes the fun out of it that way, doesn’t it?  I mean, the “glory” of writing code that’ll do something I haven’t found anyone else do, that’s a little boost of ego and all.  Still, it’s hard to believe this kind of thing hasn’t been solved elsewhere – am I the only person with this bad of a memory, and this much of a drive to keep myself from looking like Leonard Shelby at every meetup?

What’s worse though, for embarking on this thing, is that I’d bet in six months’ time, LinkedIn and/or iCloud will have ‘broken’ enough of their site(s) that I wouldn’t be able to just re-use what I wrote the first time.  Maintenance of this kind of specialized/unique code feels pretty brutal, especially if no one else is expected to use it (or at least, I don’t have any kind of following to make it likely folks will find my stuff on github).

Still, I don’t think I can leave this itch entirely unscratched.  My gut tells me I should dig into that Contacts API first before embarking on the spelunking adventure that is Beautiful Soup.

How Do I Know What Success Looks Like?

I was asked recently what I do to ensure my team knows what success looks like.  I generally start with a clear definition of done, then factor usage and satisfaction into my evaluation of success-via-customers.

Evaluation Schema

Having a clear idea of what “done” looks like means having crisp answers to questions like:

  • Who am I building for?
    • Building for “everyone” usually means it doesn’t work well for anyone
  • What problem is it fixing for them?
    • I normally evaluate problems-to-solve based on the new actions or decisions the user can take *with* the solution that they can’t take *without* it
  • Does this deliver more business value than other work we’re considering?
    • Delivering value we can believe in is great, and obviously we ought to have a sense that this has higher value than the competing items on our backlog

What About The Rest?

My backlog of “ideas” is a place where I often leave things to bake.  Until I have a clear picture in my mind who will benefit from this (and just as importantly, who will not), and until I can articulate how this makes the user’s life measurably better, I won’t pull an idea into the near-term roadmap let alone start breaking it down for iteration prioritization.

In my experience there are lots of great ideas people have that they’ll bring to whoever they believe is the authority for “getting shit into the product”.  Engineers, sales, customers – all have ideas they think should get done.  One time my Principal Engineer spent an hour talking me through a hyper-normalized data model enhancement for my product.  Another time, I heard loudly from many customers that they wanted us to support their use of MongoDB with a specific development platform.

I thanked them for their feedback, and I earnestly spent time thinking about the implications – how do I know there’s a clear value prop for this work?

  • Is there one specific user role/usage model that this obviously supports?
  • Would it make users’ lives demonstrably better in accomplishing their business goals & workflows with the product as they currently use it?
  • Would the engineering effort support/complement other changes that we were planning to make?
  • Was this a dealbreaker for any user/customer, and not merely an annoyance or a “that’s something we *should* do”?
  • Is this something that addresses a gap/need right now – not just “good engineering that should become useful in the future”?  (There’s lots of cool things that would be fun to work on – one time I sat through a day-long engineering wish list session – but we’re lucky if we can carve out a minor portion of the team’s capacity away from the things that will help right now.)

If I don’t get at least a flash of sweat and “heat” that this is worth pursuing (I didn’t with the examples mentioned), then these things go on the backlog and they wait.  Usually the important items will come back up, again and again.  (Sometimes the unimportant things too.)  When they resurface, I test them against product strategy, currently-prioritized (and sized) roadmap and our prioritization scoring model, and I look for evidence that shows me this new idea beats something we’re already planning on doing.

If I have a strong impression that I can say “yes” to some or all of these, then it also usually comes along with a number of assumptions I’m willing to test, and effort I’m willing to put in to articulate the results this needs to deliver [usually in a phased approach].

Delivery

At that point we switch into execution and refinement mode – while we’ve already had some roughing-out discussions with engineering and design, this is where backlog grooming hammers out the questions and unknowns that bring us to a state where (a) the delivery team is confident what they’re meant to create and (b) estimates fall within a narrow range of guesses [i.e. we’re not hearing “could take a day, could take a week” – that’s a code smell].

Along the way I’m always emphasizing what result the user wants to see – because shit happens, surprises arise, priorities shift, the delivery team needs a solid defender of the result we’re going to deliver for the customer.  That doesn’t mean don’t flex on the details, or don’t change priorities as market conditions change, but it does mean providing a consistent voice that shines through the clutter and confusion of all the details, questions and opinions that inevitably arise as the feature/enhancement/story gets closer to delivery.

It also means making sure that your “voice of the customer” is actually informed by the customer, so as you’re developing definition of Done, mockups, prototypes and alpha/beta versions, I’ve made a point of taking the opportunity where it exists to pull in a customer or three for a usability test, or a customer proxy (TSE, consultant, success advocate) to give me their feedback, reaction and thinking in response to whatever deliverables we have available.

The most important part of putting in this effort to listen, though, is learning and adapting to the feedback.  It doesn’t mean rip-sawing in response to any contrary input, but it does mean absorbing it and making sure you’re not being pig-headed about the up-front ideas you generated that are more than likely wrong in small or big ways.  One of my colleagues has articulated this as Presumptive Design, whereby your up-front presumptions are going to be wrong, and the best thing you can do is to put those ideas in front of customers, users, proxies as fast and frequently as possible to find out how wrong you are.

Evaluating Success

Up front and along the way, I develop a sense of what success will look like when it’s out there, and that usually takes the form of quantity and quality – useage of the feature, and satisfaction with the feature.  Getting instrumentation of the feature in place is a brilliant but low-fidelity way of understanding whether it was deemed useful – if numbers and ratios are high in the first week and then steadily drop off the longer folks use it, that’s a signal to investigate more deeply.  The user satisfaction side – post-hoc surveys, customer calls – to get a sense of NPS-like confidence and “recommendability” are higher-fidelity means of validating how it’s actually impacting real humans.

This time, success: Flask-on-AWS tutorial (with advanced use of virtualenv)

Last time I tried this, I ended up semi-deliberately choosing to use Python 3 for a tutorial (I didn’t realize quickly enough) was built around Python 2.

After cleaning up my experiment I remembered that the default python on my MacBook was still python 2.7.10, which gave me the idea I might be able to re-run that tutorial with all-Python 2 dependencies.  Or so it seemed.

Strangely, the first step both went better and no better than last time:

Mac4Mike:flask-aws-tutorial mike$ virtualenv flask-aws
Using base prefix '/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5'
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3.5
Also creating executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Installing setuptools, pip, wheel...done.

Yes it didn’t throw any errors, but no it didn’t use the base Python 2 that I’d hoped.  Somehow the fact that I’ve installed Python 3 on my system is still getting picked up by virtualenv, so I needed to dig further into how virtualenv can be used to truly insulate from Python 3.

Found a decent article here that gave me hope, and even though they punted to using the virtualenvwrapper scripts, it still clued me in to the virtualenv parameter “-p”, so this seemed to work like a charm:

Mac4Mike:flask-aws-tutorial mike$ virtualenv flask-aws -p /usr/bin/python
Running virtualenv with interpreter /usr/bin/python
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Installing setuptools, pip, wheel...done.

This time?  The requirements install worked like a charm:

Successfully installed Flask-0.10.1 Flask-SQLAlchemy-2.0 Flask-WTF-0.10.3 Jinja2-2.7.3 MarkupSafe-0.23 PyMySQL-0.6.3 SQLAlchemy-0.9.8 WTForms-2.0.1 Werkzeug-0.9.6 argparse-1.2.1 boto-2.28.0 itsdangerous-0.24 newrelic-2.74.0.54

Then (since I still had all the config in place), I ran pip install awsebcli and skipped all the way to the bottom of the tutorial and tried eb deploy:

INFO: Deploying new version to instance(s).                         
ERROR: Your requirements.txt is invalid. Snapshot your logs for details.
ERROR: [Instance: i-01b45c4d01c070555] Command failed on instance. Return code: 1 Output: (TRUNCATED)...)
  File "/usr/lib64/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/opt/python/run/venv/bin/pip install -r /opt/python/ondeck/app/requirements.txt' returned non-zero exit status 1. 
Hook /opt/elasticbeanstalk/hooks/appdeploy/pre/03deploy.py failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
INFO: Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
ERROR: Unsuccessful command execution on instance id(s) 'i-01b45c4d01c070555'. Aborting the operation.
ERROR: Failed to deploy application.

This kept barfing over and over until I remembered that the target environment was still configured for Python 3.4.  Fortunately or not, you can’t change major versions of the platform – so back to eb init I go (with the -i parameter to re-initialize).

This time around?  The command eb deploy worked like a charm.

Lesson: be *very* explicit about your Python versions when messing with someone else’s code.  [Duh.]