Lean Into Those “Single Points of Failure”
“You should go all in on your single points of failure,” I say with obnoxiously casual confidence, moments before incredulous glares dart my way.
In 2015, Amazon’s DynamoDB database service suffered from a multi-day outage in their US East Coast region that had major ripple effects. Queueing (SQS), auto-scaling for compute (EC2), and metrics (CloudWatch) services within AWS were severely impacted as their core functionality depends on DynamoDB. There was much weeping and gnashing of teeth.
It seems reasonable that Amazon would then try to reduce dependency on DynamoDB in the future, but they did exactly the opposite! So why on Earth would they want to double down on what some would call a Single Point of Failure?
After all, no one wants a Single Point of Failure (SPOF) in their system. The history of the Internet is littered with stories about failure caused by SPOFs and the various technology developed to mitigate them. You definitely don’t want to play the leading role in any one of these failure stories.
A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working.
It’s a fairly obvious definition from Wikipedia. Though it really comes down to how “part of a system” is defined. The next section goes on to talk about redundancy at various levels: should it be disk drives? machines? entire datacenters? This isn’t helping reduce the ambiguity!
First, I think we all agree that a single hard drive or machine shouldn’t cause systemic failure. That’s definitely worth remediating. Though I think this can be distracting because solving this category of SPOF has become mostly trivial in modern times. There are myriad solutions. Pick the ones that suit the product and architecture.
Recently though, I’ve repeatedly heard entire fault-tolerant systems called SPOFs: Kubernetes, S3, ZooKeeper, Consul, DynamoDB, Cassandra. It’s really people showing me their scar tissue. Always-on technology was the promise, but instead it failed in spectacular ways.
The common thread running through these stories is largely underinvestment, both in terms of adoption and slow abandonment after the first scare.
Them: “We tried to use Kubernetes for this one thing, but it was a disaster.”
Me: “Why?”
Them: “Oh it just stopped scheduling containers for some reason. To be honest, we didn’t really have time to look into it.”
The resulting design-by-incident approach tends to yield objectively worse systems. Technology is avoided based on one or two bad experiences. The “SPOF” label becomes a pejorative. Inevitably it spirals downward into Frankenstein systems with their various “backup” components possessing roughly identical functionality.
The best-performing services do exactly the opposite. They are full of SPOFs. The Internet? All in on DNS. Google’s service deployments? All managed by Borg. AWS? Their higher-level services are largely built with S3 and DynamoDB. All of Instagram’s user data is stored in the TAO database service— there is no alternate.
The power of leaning into these so-called SPOFs comes in two phases. First, it incentivizes investment. Systems fail in unanticipated ways. That is unavoidable. Substituting another technology as a quick fix is Magical Thinking, but integrating the learnings from failure hardens them for the future. The more failure, the more learnings, the more hardened a system becomes.
What emerges over time is even more of an asset — a point of leverage forms. When a new problem appears, a solution only needs to be applied in one place. In 2018, AWS announced that all data in DynamoDB had been and would forever be encrypted at-rest. All the services built on DynamoDB immediately received the benefit. This is incredibly powerful.
So even though it is tempting to put a permanent curse on the technology that wronged you, you may need to do exactly the opposite. Understand what’s driving failure and you will likely find that some additional investment will prevent future mishaps. Then push for fewer technologies in the stack — it’ll make investment easier to justify, resulting in a more battle-hardened system over time.
“All the wood behind one arrow” — Scott McNealy (CEO, Sun Microsystems)
In short, embrace the SPOF. Put everything you’ve got behind them. You’ll build more reliable systems because of it.