5 Things SaaS Companies Should Know About Observability
Don’t fall into the trap of believing that because you’ve tied data together in a single pane of glass you now have observability.
1. Observability Is Not Observability…Is Not Observability
Observability is a trendy but confusing term these days. Seemingly it’s been applied to anything to provide a fresh new spin on old ideas. A quick Google search will return you results on Data Observability, Unified Observability, Edge Observability, Cloud-Native Observability, Security Observability…and much more.
Ignore all of that b/s. As a SaaS company, you should be obsessed with understanding the experience you offer your customers and the root cause of issues that may cause them to churn. Observing precisely this is the only thing that matters.
The term observability dates back to control systems theory in the 1960s when a guy called Rudolph Kalman coined the definition of observability as “the ability to measure the internal states of a system by examining its outputs.”
Hence, if you can collect the digital exhaust fumes – i.e. telemetry data – from your application and its supporting infrastructure then you should be able to determine the health of your entire ecosystem. That means investigating and determining the root cause of issues your customers may be experiencing. Fix those in a timely fashion and you’re golden.
2. The Mythical Single Pain of Glass
For as long as I can remember, management types have been obsessed with their single pane of glass. The dream went something like this: “Imagine if I had a single screen which showed me everything I needed to keep a watchful eye on and let me know immediately if there was an issue.”
After the executive decree, person-years were then sacrificed in the lower echelons of the organization to find the data that needed to be displayed, and then more person-years were sacrificed on colors and pretty chart types to make it as attractive as possible. After all, management only really pays attention to pretty charts.
Dashboards – no matter how pretty – typically reflect what happened in the past, not what will happen in the future. This is especially true with modern applications. With new code running in production every day you can expect to see new and unknown issues. These issues will almost certainly not be reflected on any dashboard.
As a SaaS company, Mean Time To Resolution (MTTR) on issues is critical – your reputation and your customer experience depends on it. You may have no other option but to spend time and energy on producing pretty dashboards to keep executives happy, but don’t fall into the trap of believing that because you’ve tied data together in a single pane of glass you now have observability.
3. It’s About the Investigation, Stupid
So, if observability isn’t about bringing together snippets of information from your disparate logs, metrics, and traces into a single pane of glass, then what is it about?
When a new issue is discovered, most of the time is spent investigating the issue and determining the root cause. With modern applications, things are even harder because new code is released into production weekly – or even daily. That means new and unpredictable issues will become commonplace in production.
The key to a speedy investigation is context. When working with a recent customer we discovered that almost half their troubleshooting time was spent correlating event data. Various engineering teams would analyze the issue at hand from their specific vantage point, using their favorite tool, then post screenshots of their findings on a Slack channel. Next, someone – and every company has one – with a big brain would come along to eyeball the screenshots and use their intimate knowledge of the application to determine exactly where the issue was.
This is crazy, but sadly ‘state of the art’. Correlating event data to provide more context in an investigation should be done in software, not in the brain of the smartest person in the room. This way of investigating is faster, can be performed by less experienced engineers, and leads to more predictable outcomes.
Correlating event data to provide more context in an investigation should be done in software, not in the brain of the smartest person in the room.
As a SaaS company, MTTR time is critical and with new code in production each day you will be testing each day. Being able to reduce MTTR by speeding up investigations – and doing so without expert help – is key to scaling your operation AND keeping your customers happy.
4. Pricing Can Suck Less
Let’s face it, no one likes paying. The question is whether what you’re paying is commensurate with the value you receive.
Most companies today have a hodge-podge of tools for monitoring, log analytics, APMs, etc. And as a result, find themselves increasingly frustrated with the cost of those tools because as their volume of telemetry data rises 40% each year their bill goes up (at least) 40%. But that’s okay because at least the collection of tools they’ve amassed doesn’t deliver observability. Wait, what?!
Remedies for this situation often include adding a dashboard to pull together data from the aforementioned collection of tools to convince management that they have observability. Or perhaps buying a ‘Data Observability’ tool to filter telemetry data that are sent to each tool, thereby lowering the cost of not having observability.
The problem is that legacy tools have legacy architectures, and legacy architectures can’t keep up with the demands of investigating with modern apps or the sheer increase in data volumes.
5. There’s No Free Lunch
Observability is a new kid on the block and regardless of your starting point, there is going to be work involved. That said, how you approach your journey to observability is going to determine exactly how much work you take on.
If you’re just starting, congratulations, you have what’s known as the stereotypical “green field” and can freely choose your style of instrumentation. Increasingly popular these days is OpenTelemetry, which promises to be a vendor-neutral standard that allows you to retain your instrumentation investment regardless of which observability tool you pick now, or in the future.
However, if you have an existing code base then the chances are that you’ve already instrumented your code with logs (most likely), custom metrics (somewhat likely), and perhaps even implemented a distributed tracing library (least likely).
Either way, you have your work cut out for you. A good rule of thumb is to start with an observability platform that allows you to collect as much data as possible from as many sources as possible, and that lets you store it cheaply — easier said than done.
In terms of what data you should start ingesting, logs are a great place to start. Once you and your organization feel confident that you’re getting all the observability data you need from your logs you can then move on to metrics, and then ideally tracing at a later down the road.