Every engineer at Hipmunk can deploy code to production. This hasn't always been the case and enabling it required us to overhaul our deploy system to be safe and easy to use. The new system has been in production for just over a year now.
A Bit of History
A year ago, the process for deploying involved poking someone who had permission to deploy. Deploying was a complex and error-prone process that involved invoking several scripts and reciting the correct black magic incantation. Because of the required complexity and knowledge, we only gave people with that knowledge permission to deploy. And since these people were typically our most experienced engineers, we lost a lot of their very valuable time.
This quickly became a bottleneck in our dev process. Getting code shipped required a song and dance where a deployer would ask if anyone needed anything deployed while a bunch of engineers clamored to get their patches out. Engineers got frustrated with having to ask deployers to ship their code and deployers got tired of spending much of their day deploying.
The old process looked like this:
- Ping chat room asking for a deploy and hope someone's available
- Deployer collects list of pull requests ready to be deployed
- Deployer squashes and merges each pull request into master
- Deployer kicks off the deploy script
Building the New System
There were several problems with the old system.
- Frustration from both sides about not being able or having to deploy.
- Diminished sense of ownership could potentially make people think of releasing code as someone else's problem.
- Deploying required access to production, giving us another reason to limit access.
- The master branch contained unproven code during deploys. If someone pulled from master during a deploy and a commit later had to be reverted they would have to deal with untangling their branch.
- Dealing with issues that came up while deploying required prior knowledge of how to fix it. This made onboarding new deployers difficult.
- New code was rolled out to all servers at once so if something broke, it broke badly and across the board.
- Steps of the deploy process could easily be forgotten. Sometimes you would have to purge the CDN, sometimes you needed to update a job's configuration, etc.
We set out to build a new system that would address all of these shortcomings. The primary objective was that anyone should be able to deploy and to do that we had to make it dead simple to use and easy to recover from errors.
How It Works
The new system was built by enhancing
gyp. We were inspired by the way GitHub did it and took a similar approach with some modifications for our particular needs.
Here's the play-by-play:
- A developer queues up their pull request to be deployed via
gyp(pronounced with a hard G).
- This will schedule a "deploy train" which will depart automatically in a few minutes.
- If other people queue up pull requests around the same time, they will be added to the deploy train. A risk limit is enforced to prevent deploys from getting too big.
- When the train is ready to leave, a Jenkins job is kicked off.
- Jenkins merges everything into a release branch, builds the static assets, packages up the app, and delivers the code to our servers. The code is not yet live.
- A message is automatically sent to our Slack channel to let the authors know that the code is ready to be deployed.
- The authors may approve or abort the deploy via
gyp. The deploy will auto-abort after a certain amount of time.
- If the green light is given, the new code goes live on a small set of servers (our canary cluster).
- It will stay there for a short while so that devs can monitor it and verify things aren't breaking.
- If everything looks good, it's rolled out to all our servers.
- Finally, we merge the release branch into master and reset the deploy queue.
There are some pretty neat features here that are worth highlighting.
Each stage of the deploy process sends a message to Slack so that we can easily monitor it. These Slack messages describe what's going on and callout the commands that are available at that stage. The exact command is included so you can copy it and drop it straight into a terminal. No need to guess what command to run or how to run it.
You can see the exact commands needed at each step of the process.
We use Slack @mentions to ping people when they're needed. This could be to approve the release, let people know that their changes are live on the canary cluster, or even to notify them the deploy is done so they can queue something up next.
We use GitHub labels to trigger different actions at the end of a deploy. For example, pull requests labeled with "Frontend" will automatically purge our CDN. Pull requests labeled "Update Rundeck" will automatically update Rundeck (our job scheduler).
We have several improvements we'd like to make in the future. We want to beef up our canary cluster to eliminate some current limitations regarding testing frontend changes. We want to add some automatic health checks as there's still a lot of developer monitoring required. And longer-term we'd like tighter integration with Slack so devs don't have to swap back and forth between the terminal and Slack. But overall we're very happy with the current state of things, especially when looking back and comparing it to what we had.