Reaching Your Full Stack Observability Goal
It’s a familiar story… the system has failed somewhere, monitors are going off here and there, and the incident bridge gets started. More and more people get pulled in with every minute that the bridge is running, and everyone starts to remember that old story about describing an elephant in the dark. Six teams trying to troubleshoot the issue, or rule themselves out, and all of these teams must use different tools and share screenshots on Slack or Teams. The whole process goes nowhere until the right set of subject matter experts can correlate each tool’s data and determine the root cause. Surely this isn’t the state of the art? Millions of dollars spent in software and people and none of them are talking to each other until the sirens start wailing… surely there is a better way?
At Observe, we talk all the time about “Big O” Observability, that target state at which a user can pivot across data sets and solve problems at speed. But if we look at where a lot of folks are starting their journey, it’s back at “I’m drowning in logs and I don’t even know what traces are for!” or “my metrics tool costs too much, what can I do?” It’s not that Observability is right or wrong for this sort of organization, it’s that the immediate problems are blocking sight of the strategic goal. Let’s dive into what that goal looks like and some ideas for getting there.
What is Observability and What’s Your Ideal State?
There’s a lot of discussion on the Internet about what Observability or Security Observability is and isn’t, but maybe that’s begging a question. Sure, you need an Observability approach to handle modern applications just because of the sheer volume of data. Modern applications on modern infrastructure can produce orders of magnitude more telemetry than legacy solutions were designed for, but that’s not the only reason to pursue Observability. The big value proposition of Observability is getting operational and security insight into systems without having to design that insight in when you write the software. The goal isn’t just storing logs and metrics, it’s using them. Used properly, Observability can help answer questions that organizations have struggled with for decades: What’s broken, where, and how?
Now, let’s take a step back and break our description of Nirvana into parts. Perfect Observability can be a little different for a lot of organizations, but some items in this list should resonate for your team:
- One data lake, One console, One language. The cardinal failing of legacy tools for Observability is clear: because they’re coming from logs or metrics or traces, they bought or built something else to fill in the other pillars. Maybe there’s a little user experience glue to make those separate products look like they’re fellow travelers, but at the end of the day, it’s not for real. Even if they’ve managed to put a single experience over a pile of federated data systems, they’re still dealing with a least common denominator. Different compute models, different data retention rules, different assumptions to little things like unit sizes, resource identification, naming conventions, and the behavior of a NULL.
- Separated compute and storage costs. There’s an implicit assumption in some license models that all data is equally valuable, but it’s just not true. Machine exhaust data can be very valuable when you need it, but sometimes it’s just… not. So you might end up redirecting some of it to another home entirely and losing all your context, or forcing a cumbersome hot-warm-cold hierarchical storage stack on your DevOps teams and security analysts. At the end of the day, if an observability system encourages you to put your data where you can’t use it, maybe that system isn’t good enough for today’s requirements.
- Focus on User Experience. An easy trap to fall into for Observability focused products is to rethink the problem as a brand new user experience, coming from an unfamiliar mindset based on radical new technology. The real world doesn’t work like that; by definition, an Observability product needs to support teams that are expecting a log search tool, an APM tool, a metrics monitoring tool, a security threat hunting tool. Done right, Observability has the flexibility to meet those multiple teams with multiple goals where they are now. The goal is to make their outcomes better by introducing new benefits, not to ask users to start over.
Observability doesn’t mean “spin your swivel chair between federated systems.” Even if your upstream vendor has found a way to sell log search, metrics monitoring, and APM in a one-size-fits-all trenchcoat, it’s not going to make those incident bridges go any better.
How to Achieve Full Stack Observability
Getting things done has similarities all over, and this won’t feel too different from other projects. You know best how to navigate your organization to help your team. This blog is just offering some ideas and patterns in the hope that they might assist.
Locate the City on the Hill
There’s a compelling vision out there that can motivate your teammates to help you achieve a goal. Maybe it’s a six-pager or a BHAG, maybe it’s a mockup or a slide deck, or just a spreadsheet tab with a projected cost model. The how is less important than the what, which might be something like “in the future, incident bridges will conclude in under an hour”.
A goal like that is compelling, and lets people point to what is blocking them from getting there. Maybe the data isn’t where they’re looking, because it’s already rolled off to historical storage? Maybe knowledgeable people don’t have a search tool or dashboard where they can find what they are looking for? Is it possible to drill from symptom to cause, or do your teammates have to spend hours devising and executing experiments to even determine a cause?
A compelling goal helps everyone share a vision: what do you all want from Observability, why is that valuable, and what’s blocking you from achieving it.
Picking the Right Observability Data
Next step is to get the data that you need to support your team’s dream tool. An Observability journey may have started with what you need, or it might have started on some completely different goal, so you’ll need to assess if you’re already collecting what you need. Starting one application at a time, you can look at the services that support it and ensure that the data needed to troubleshoot is coming into Observe. For instance, you might have a stack like this:
- Kubernetes: Your service orchestrator layer could be taking care of collecting your application’s troubleshooting data and telemetry directly, or might be able to support a daemonset that will do collection for you.
- Infrastructure: Your IaaS provider (e.g. AWS, Azure, or GCP) produces huge amounts of visibility, which can be easy to collect but tough to filter.
- Jenkins: In order to get into the infrastructure, your changes probably need to make it through testing; maybe the fixes you’re expecting were actually hung up here?
- GitHub: What’s been changed, when, and by whom?
- Jira: What was the motivation for that change, and is a customer or project linked to it?
- Salesforce: Is that linked customer on your critical wins list?
Each layer can produce many types of data and each type of data has different benefits and use cases, so it’s important to reflect on why and how you’ll use them.
- Logs: Logs range in utility from an exact explanation of internal state to a wild stack trace across a half-dozen libraries you’ve never heard of. They’re relatively simple, they’re typically expensive to store and search, and they’re explicitly called out in dozens of compliance standards. You typically have to deal with them for security and compliance anyway, so logs are a fine place to start your Observability journey.
- Traces: Traces are great, like automatically generated logs with contextual information that helps you cross between services and follow transactions… but they require instrumentation that you might not already have. If you’re able to get traces, great, let’s talk about how to direct them into Observe. If you’re not, you can always return to this. Traces can provide better visibility, but letting “perfect” be the enemy of “good enough” rarely ends well.
- Metrics: Metrics are mostly valuable in aggregate, but can be expensive to manage; many systems will suggest sampling, but how can you say if that’s necessary or useful until you’ve looked at the data? Once again, separation of compute from storage assists, by letting you gather data at full fidelity and use it at a level that makes economic sense for your team.
Data Observability 101
So often we see that people have been blocked from doing the simple but effective stuff by a lack of access to their data; so we recommend starting this process with the KISS principle in mind. Basic use of Observe’s Explorers and some simple dashboards might be enough to kick-start things you’d never expect.
- Log Explorer provides a search bar driven gateway to your data. Use spreadsheet-like filtration tools, the OPAL search language, and powerful pattern detection to see what’s there.
- Trace Explorer answers the need for transactional awareness over time, bringing your RED metrics to the top of visibility and helping you drill into exactly what’s happening in an intuitive flame chart.
- Metrics Explorer puts charts front and center: align and visualize your metrics at speed and scale with an easy to use expression builder, then move your results to dashboards and monitors.
- Resource Explorer helps you to navigate from the event of something happening to the cause…find common denominators, follow links across abstraction layers, and maintain context despite irrelevant changes.
Whatever you uncover with an explorer can easily be added to a dashboard or monitor, tested out, and refined.
Sustaining and Improving your Observability Tooling
Use the tool, and regularly evaluate its performance. In fact, you can use Observe to review usage of Observe, so use our Usage Dashboard as a starting point. Look at whether your team is using the dashboards and explorers that you expected, and talk about them in your regular meetings.
- What could make it more useful?
- What could make it more efficient?
Finally… identify the next attainable goal. There’s probably another application out there that you need to reduce MTTR for, or maybe your Mean Time to Detection for security incidents needs Observability; so start again with this new goal. Every problem you solve in Observe becomes foundational for the next one, because more data leads to more contextual awareness between resources.
Take Advantage of The Observability Cloud
Observe is uniquely able to provide “Big O” Observability, because we have built a modern platform for that purpose with better economics, features, and scalability. The Observability Cloud meets you where you are, onboarding the logs, metrics, and traces that you’re already relying on and giving you familiar interfaces to get started with. You don’t have to start over to start with Observe, we’ll help you to shift current loads and then start your journey from a better footing.
Come chat with us in person at an upcoming Observability Event, or try Observe out on your own!