February 27, 2018
I understand the frustration of the
Crowley says that they've recently started to see scalability problems with the old way of operations, however, which led Slack to create a secondary on-call rotation full of developers; software and performance bugs, he says, are becoming much more common than low-level infrastructure problems-bugs that only the development teams know how to fix.
To me, it's a no-brainer: if the root cause of most incidents
Over the past decade, I've primarily worked on small, fast-moving development teams. I've always valued my time away from the office and believe our developers should too. Developers have always been apart of these on-call rotations and I haven't hated this experience. Below are five traits I've seen from teams that have a healthy relationship with being on-call:
Your hip startup office walls may be covered with posters of the famous Facebook "Move fast and break things" slogan. What you might not know - Facebook changed that motto to "Move fast with stable infrastructure".
That's a lot less of a catchy, poster-worthy slogan, isn't it?
On my teams, we've prioritized a calm system over everything else. You can't have everything, so this means we'll prioritize things like clear rollback instructions in PRs, logging, error handling, and database query edge cases over code syntax issues.
In a conversation with my dad when our kids were in the baby stage, he remarked that the hardest part was changing diapers. I was shocked. To my wife and I, it was clearly the lack of sleep.
Our kids are just past that stage, but I've already forgotten the feeling of being constantly tired. I can see how that feeling would fade for my father.
Performance incidents on fast-moving, small teams are like this: when an incident occurs, it's important to implement a fix while the pain is still top-of-mind. Otherwise, you'll forget the pain and will be far less motivated to address the problem. Ensure developers know they can - and should - bump resolving future stability woes to the top of the list.
It cites the examples of the aviation industry's approach to process, which enables remarkable creativity under stressful conditions by mental automation of routine operations.
My teams have all seen a similar pattern: you can't push forward on the product when it feels like things are constantly falling down around you and there isn't an established checklist for handling incidents. Systematizing one thing frees up the mind for creative work.
The Google SRE Book advocates for an error budget:
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
Few services really require five nines of uptime. High availability stifles forward progress as many issues are triggered by changes to
All of the above falls down if developers don't have empathy for their fellow team members. At some point, everyone makes a change that requires a colleague to
If engineering teams are prioritizing stable systems, it's very likely that repeat incidents aren't occurring. Instead, you'll be paged on new and slightly different versions of problems. These issues can be difficult to identify in typical dashboard-driven monitoring systems as they root cause gets lost on overview charts. Identifying these outliers - and the conditions that trigger them - is a focus of our APM product at Scout. I'm also a fan of Honeycomb, which is designed for solving high-cardinality problems.
I've heard on-call horror stories at teams large and small. If you're looking at joining a new team, you should ask about how on-call is