The Push Train
notes date: 2017-11-03
source links:
source date: 2017-05-01
- Advised lots of companies on continuous delivery
- thought you can take tools from one place where the process works well (Etsy) and drop it in to other environments
- but it turns out not to work well
- CD is not just tools
- This talk describes what it’s like to live inside of a group with working CD.
- Things that made it work
- Dedication to making every engineer effective
- Goal of shipping lots of code fast and safely
- Community Responsibility
- Things that made it work
- It’s mostly about orchestrating humans
- Built velocity toward about 40-60 deploys per day, starting from a time when the company had dozens of deployers, sustaining it to 150 deployers. Probably stops working at some point above that.
- Integration tests
- they’re hard to maintain. Web pages nowadays are always stateful/sessiony, so concurrency is a real issue, and sporadic failures are a reality
- “Tests are one way to gain some confidence that a change is safe. But that’s all they are. Just one way.”
- Ramping up code very gradually in production is another way.
- Deploying code in smaller and smaller pieces is another way.
- Deploying code so that users can’t see it and then using feature flags to test it in production
- Prioritize monitoring and reactive capacity
- “You may have tests, but you must have monitoring. You can write an infinite number of tests and only asymptotically approach zero problems in production.”
- manage how source control is used
- git was created to support the linux kernel, where many versions are supported concurrently
- “A website, on the other hand, doesn’t really have a version at all. It has a current state.”
- Branch in code, not in revision control: “We eventually came to a realization that there was no point to the rituals we were performing with revision control. It’s better to conceive of a website as a living organism where all of the versions of it are jumbled together”
- The fast path is the blessed path
- “If the deploy tooling isn’t made fast, there’s probably a faster and more dangerous way to do things and people will do that.”
- New problem once deploys are fast: deploys are no longer special occasions; people even wander away
- Automated deploys take it too far, and are bad
- “It’s entirely possible to automate the whole deployment pipeline. In fact a lot of people think that that is what continuous deployment and/or delivery is.”
- “Let me give you the following analogy. A while ago Uber yolo’d a self-driving car trial in downtown San Francisco. It ended abruptly right after a video surfaced of one rolling right through a pedestrian intersection during a red light.”
- " It’s important to note that there was a human sitting in the driver’s seat, but that person didn’t intervene."
- “Uber blamed that person for the incident. But that’s the wrong way to look at it. The automation was capable enough that the human’s attention very understandably lapsed, but not capable enough to replace the human.”
- “The human and the car are, together, the system. Things you do to automate the car affect the human.”
- “One must exercise good taste when it comes to automation.”
- Design systems to keep humans engaged
- Another problem that arises with fast deploys: people skip steps
- you want a web cockpit that can control deployments, not a CLI tool
- putting the control tools away from the awareness tools enables action without awareness
- situational awareness
- what’s in this deploy (tag or committers or commits?)
- how long since last deploy
- what’s the status of your tests (running, succeeded, or failed)
- related changes happening now/recently?
- incidents in progress
- the monitoring
- With fast deploys: people who might need to react to your deploys might not know they’re happening
- “Releasing code at my first job was a nightmare in every respect. You’d show up on Saturday morning, and spend several days on it. It sucked.”
- “But one criticism you couldn’t make of this is that nobody knew it was happening. Brutality isn’t a great ethos, but it is at least an ethos.”
- At some point, you can’t deploy faster but you keep getting more engineers
- Approaches:
- Split up deployables (microservices!), which is not cost-free
- Find common deployment patterns that are safest, and fast-track these
- Dark code
- code changes that are not activated immediately on deploy, but sometime later (via changes of some control flag)
- mark changesets that are entirely made of dark code
- dark changesets can safely be bundled into deploys with other changesets–the operator is in theory shipping only their own code
- Bundle multiple non-dark changesets into single deploys
- when deploying 2 or more changesets that will be active immediately on deploy, get their committers to coordinate on monitoring their respective changes
- Etsy wrote an IRCbot
- keeps a queue of changesets ready to go out
- splits the queue into deploys, each of a fixed maximum number of changesets (each of these deploys is a train car/push train)
- Coordinate the current deployment among the owners of its changesets.
- Dark code
- Concluding Themes
- “The tendency once you’ve programmed yourself into a serious hole is to keep programming. Maybe you should stop trying to program your way out of difficult situations.”