DevOps status report: HackOregon 2019 season

One of my colleagues on the HackOregon project this year sent around “Nice post on infrastructure as code and what getting solid infra deploys in place can unlock” https://www.honeycomb.io/blog/treading-in-haunted-graveyards/

I felt immediately compelled to respond, saying:

Provocative thinking, and we are well on our way I’d say.

I’ve been the DevOps lead for HackOregon for three years now, and more often than not delivering 80% of the infrastructure each year – the CI/CD pipeline, the automation scripts for standardizing and migrating configuration and data into the AWS layers, and the troubleshooting and white-glove onboarding of each project’s teams where they touch the AWS infrastructure.

There’s great people to work with too – on the occasions when they’ve got the bandwidth to help debug some nasty problem, or see what I’ve been too bleary-eyed to notice is getting in our way, it’s been gratifying to pair up and work these challenges through to a workable (if not always elegant) solution.

My two most important guiding principles on this project have been:

  • Get project developers productive as soon as possible – ensure they have a Continuous Deployment pipeline that gets their project into the cloud, and allows them to see that it works so they can quickly see when a future commit breaks it
  • “working > good > fast” – get something working first, make it “good” (remove the hard-coding, the quick workarounds) second, then make it automated, reusable and documented

We’re married pretty solidly to the AWS platform, and to a CloudFormation-based orchestration model.  It’s evolved (slowly) over the years, as we’ve introspected the AWS Labs EC2 reference architecture, and as I’ve pulled apart the pieces of that stack one by one and repurposed that architecture to our needs.

Getting our CloudFormation templates to a place where we can launch an entirely separate test instance of the whole stack was a huge step forward from “welp, we always gotta debug in prod”. That goal was met about a month ago, and the stack went from “mysterious and murky” to “tractably refactorable and extensible”.

Stage two was digging deep enough into the graveyard to understand how the ECS parts fit together, so that we could swap EC2 for Fargate on a container-by-container basis. That was a painful transition but ultimately paid off – we’re well on our way, and can now add containerised tasks without also having to juggle a whole lot of maintenance of the EC2 boxes that are a velocity-sapping drag on our progress.

Stage 3 has been refactoring our ECS service templates into a standardised single template used by whole families of containerised tasks, from a spray of copypasta hard-coded replicas that (a) had to be curated by hand (much like our previous years’ containerised APIs has to be maintained one at a time), and (b) buried the lede on what unique configuration was being used in each service. Any of the goofy bits you need to know ahead of deploying the next container are now obvious and all in one place, the single master.yaml.

I can’t speak for everyone, but I’ve been pretty slavish about pushing all CF changes to the repo in branches and merging when the next round of stable/working infra has been reached. There’s always room for improvement, however:

  • smaller changes are always better
  • we could afford more folks who are trained and comfortable with the complex orchestration embedded in our infrastructure-as-code
  • which would mean being able to conduct good reviews before merge-to-master
  • I’d be interested in how we can automate the validation of commit-timed-upgrades (though that would require more than a single mixed-use environment).

Next up for us are tasks like:

  • refactoring all the containers into a separate stack (out of master.yaml)
  • parameterising the domains used for ALB routing
  • separating production assets from the development/staging environment
  • separating a core infra layer from the staging vs production side-by-side assets
  • refactoring the IAM provisions in our deployment (policies and attached roles)
  • pulling in more of the coupled resources such as DNS, certs and RDS into the orchestration source-controlled code
  • monitoring and alerting for real-time application health (not just infra-delivery health)
  • deploying *versioned* assets (not just :latest which becomes hard to trace backwards) automatically and version-locking the known-good production configuration each time it stabilises
  • upgrading all the 2017 and 2018 APIs to current deployment compatibility (looking for help here!)
  • assessing orchestration tech to address gaps or limitations in our current tools (e.g. YAML vs. JSON or TOML, pre-deploy validation, CF-vs.-terraform-vs-Kubernetes)
  • better use of tagging?
  • more use of delegated IAM permissions to certain pieces of the infra?

This snapshot of where we’re at doesn’t capture the full journey of all the late nights, painful rabbit holes and miraculous epiphanies