Stop Counting Production Incidents
“What can be counted doesn’t always count, and not everything that counts can be counted.” — William Bruce Cameron
Aviation incidents and fatalities go down over time. The chart above tells us that much. It also tells us something else: fatalities per incident are all over the place! Just measuring the number of incident reports would obscure this very important point.
High-performing Internet service teams all have one thing in common: an effective incident management and blameless post-mortem process. What inevitably follows such a process is unfortunate — someone will propose what seems like common sense: counting incident reports as a measure of product quality.
Facebook is radically opposed to counting incident reports — or at least was when my presence graced its halls. It seemed like one of Jay Parikh’s top tasks as the head honcho of the infrastructure team was to tell people to not count incident reports — over and over and over again. When a dashboard or a report inevitably surfaced that did so, he’d somehow magically find out about it in a terrifyingly short amount of time.
If you were the culprit, your phone would reproduce Messenger’s signature “PING!” sound shortly thereafter. You’d look down at the notification. “Oh Shit.” It’s motherfuckin’ Jay Parikh himself, with a cordial-but-firm request to “please take it down.” It is only natural to read an implicit “or else” tacked on the end.
But why so radical? There are three major reasons as to why this seemingly good idea is so perilous.
1. It’s a Cop-Out
There are effective ways to measure quality of service, and it’s definitely not the number of incidents. It’s so tempting though because the data is already there, just ripe for the picking. It’s so simple — more incidents bad, less incidents good!
The trouble is that it becomes an excuse to put off measuring product quality with metrics that reflect the actual customer experience. It takes a considerable amount of energy to derive these, so it’s quite tempting to take a shortcut. Choosing the right customer experience metrics depends entirely on the minutiae of a particular product or service. It takes deep understanding of a product and a keen insight into what its customers value.
For the backend service behind a web application, a quality metric which directly correlates to customer experience might be something like the frequency of unrecoverable request errors observed at the client. Customers should never encounter these kinds of errors. If they do it signals a real experience issue. This kind of metric is particularly powerful because it synthesizes a wide range of potential causes.
The higher-ups do often want quality rolled up into an aggregate score per product area. It may seem a little lazy, but their job is to constantly make sure people are working on improvements where most needed. I’ve written up a quick sketch for doing that exact thing.
2. Perverse Incentives
An excerpt from Harvard Business Review:
It can’t be that simple, you might argue — but psychologists and economists will tell you it is. Human beings adjust behavior based on the metrics they’re held against. Anything you measure will impel a person to optimize his score on that metric. What you measure is what you’ll get. Period.
The evidence that behavior inevitably conforms to measurement is catastrophically overwhelming. Measuring the number of incident reports will cause fewer incident reports to be filed.
I can hear it already: “but, but — our company values are honesty and collaboration, so it’s different for us.” I guess there are quite a few Chemtrail believers out there, despite similarly staggering evidence to the contrary.
The fundamental goal of a well-orchestrated incident management and post-mortem process is to learn and improve. An incident report is exactly that — a report. The process is only as useful as the reports. In general, fewer reports equals a less useful process. It’s actually best to encourage people to file a report for any verifiable serious incident, as soon as possible.
And while incidents do need to be assigned a severity level, the purpose of this exercise is to clearly communicate the scope of human response that is appropriate during an incident. If a team is being measured by how many incident reports are filed of a given severity level, they will inevitably think twice about filing or escalating. A hesitation caused by the chilling effect of measurement is not what should happen inside the pressure cooker of a production incident.
3. Not Actually Useful
Incident report frequency does not correlate in any meaningful way to the success or failure of an organization. And given that its conventional for the severity level to represent the impact high watermark during an incident, the severity level itself isn’t useful for accurately measuring total impact. So a ten hour SEV2 isn’t necessarily any better or worse than a ten minute SEV2 — it’s ambiguous.
However, there are some incredibly useful metrics to gather during the incident reporting process. Here are four universal data points everyone should gather in their incident reports:
- When the impact began
- When the team became aware of the problem
- When the incident impact was mitigated
- How the team became aware of the problem (machine alert, employee report, customer report, or The Front Page of the New York Times)
These data points can be used to derive the following metrics:
- Delay between impact and awareness
- Delay between awareness and resolution (aka MTTR)
- Proactiveness of the response
Now these are metrics that are useful for driving improvements. Any service team which improves these metrics will deliver higher quality… guaranteed. While an accurate damage assessment is essential to each report, the data is generally not comparable across incidents universally.
“But Rick, didn’t you just go on about the perverse incentives created by measurement? Won’t it incentivize people to fudge these figures?” First of all, try to avoid working with liars. There is a massive ethical difference between thinking twice about the appropriateness of filing an incident report and lying about an objective fact.
Second, an important part of the post-mortem review is verifying the report’s data points against the available data. This isn’t some kind of Kafkaesque “trust but verify” paranoia. It’s about the integrity of the process. People make mistakes. They’re probably exhausted from the incident and want to get on with their lives. Even in the most extreme cases, incident reports are relatively rare, so ensuring the correctness of each and every report is essential if improvement is genuinely the goal.
The highest incident severity level should be reserved for situations that roughly equate to “everything is fucked.” At this level of impact, nobody is going to hesitate to file.
You can probably think of a few alerts that reliably indicate this kind of situation. One of the most actionable alerts at Facebook was the dreaded egress drop. I can’t recall a single time an egress drop alert was a false positive.
The network team tracked the global aggregate throughput of every edge router. If this figure suddenly dropped substantially, and I mean something like 50%, the “everything is fucked” threshold had been reached. While a human still had to file a report, this was mostly a formality. Every single time it was a SEV1, the worst possible situation — all hands on deck.
And if the highest severity level in your process doesn’t equate to “everything is fucked” then it’s probably time to adjust the levels to make it so.
These are incidents that truly should never occur. There’s nowhere to hide, so no amount of measurement is going to create a chilling effect on the reporting mechanism. Even the Chief of Police himself, Jay Parikh, once cited the frequency of SEV1s to justify a shift in company priorities.
I’ll Leave You With This
The value of incident reporting is in the data gathered.
Getting back to measuring quality though, it’s helpful to remember that incident reports are ultimately data captured during the worst of times. One can acknowledge the inevitability of incidents and the value of learning from them, while also accepting that the vast majority of quality improvements will be found outside this process. It shouldn’t take an incident to improve quality.
Incidents ought to be sacrosanct, so don’t plunder this sacred process. Don’t create an incentive to cheapen it. Stop counting production incidents!