A Day In The Life Of: Pete Fritchman, Infrastructure Engineer
“The only difference between an SLA and an SLO is a lawyer.”
At many orgs, site reliability engineering and observability go hand in hand. However, SREs are like rare Pokémon, everyone wants them, but they can be hard to get. In our 2023 State of Observability report, 68% of organizations say they are planning to hire SREs in the next 12 months. The reality for many organizations is the relationship between SRE, DevOps and plain old infrastructure roles can be blurry and responsibilities can fall anywhere. Software engineering, DevOps, SRE and infrastructure Ops teams all use observability tools, albeit their needs will vary.
We sat down with Pete Frichtman on the Observe infrastructure team to find out what his day to day is like, and what responsibilities an infrastructure engineer/SRE hat-wearer at a modern cloud-native company has.
What’s your role and how’d you get there?
Pete: We’re responsible for wearing all infrastructure hats. So everything from the development, tools, development build pipelines to internal and production, and then monitoring our SLOs and responding to incidents. I guess my career path was a long one, because I’m old, but mostly starting out in the Sysadmin side of things. Always with a focus on tool-writing, kind of before SRE was a thing back in the early 2000s. And then I did official SRE roles for a while, and I’ve gone back and forth between what I would consider infrastructure engineering and SRE. I had a small stint being a manager and running teams, and then went back to being an individual contributor / tech lead.
What’s your typical day like?
Pete: A lot of it depends on what I’m doing that week. Part of the job is on-call and there’s a couple of different kinds of on-call we have here, and then there’s not being on-call. So if I’m on call, it’s maybe following up on stuff that I got paged on recently. When you’re on-call, it’s definitely about keeping the lights on and that is your explicit role. But when you’re not on-call there’s definitely other goals and you can probably tie most of your goals to SLOs.
There’s a large amount of interrupts that happen in our world so we have an infra interrupts rotation. The goal is to try to focus those interrupts on one or 2 people each week to avoid everyone context switching (because that gets expensive). Working on a project and then stopping over and over again, is not a great way to be productive. If I have to look at a problem someone else is having, it’s not just a glance at it to say “oh, yeah, you gotta change this line from this to that.” Some of our tools and processes are pretty complex and you really have to get in a state of mind. It’s just expensive to stop doing what you’re doing, you lose momentum.
So when you’re on for interrupts on-call – you kind of just give up on project work for the week and go “Okay, I’m gonna have a week of, you know, 5 minute or 10 minute tasks that pop up.” So I’m fixing stuff and generally not sitting down to start a big project or anything like that, so it’s looking at Slack, email, and Jira. Figuring out what’s the low-hanging fruit I can pick up and work on until the next thing pops up that’s urgent.
On weeks where I’m not on call there’s a certain number of projects in flight at any given time. So then I’m trying to look at my day and figure out where my big blocks of time are. And figure out if I’m gonna work on this code or review this design, or write this doc.
What’s the difference between SRE/DevOps/Ops?
Pete: Ops is like the old version of SRE right? No one really wants to be IT Ops anymore, because it’s got a bad connotation, but really operations is what we do, it’s just with a new way of doing it. When people ask, “What’s the difference between SRE and DevOps?” I think SRE is just like a very opinionated way of doing the job. It’s not to say non-SREs are doing it wrong, but SRE has a very different style of how to do things right, mostly focused around SLOs.
So everything should be measured. An obsession with metrics is part of the difference, but in a very specific way. Like you can have lots of metrics, but still have no data. So SLOs are getting metrics that are actually meaningful, which is very challenging and doing meaningful things with those metrics once you have them. So I think, like a really mature SRE org makes all of their decisions around SLOs and post-mortems are SLO based. What we focus on, what we work on, what new software comes in. Everything kind of has some focus on what it means to SLOs, the promises we make to the rest of the org. I think that’s the biggest difference between an SRE team and a non-SRE team. Did we hold up our end? Is this change going to make us better?
Are SLOs really that important?
You have to come up with these ways to look at the service and the user point of view. Old school monitoring is like “is my web server running? Yes, cool.” Then you think everything is fine, but if your web server is serving errors then everything actually isn’t fine, right? So then it turns out that the SLI might actually be that you’re serving traffic, but they’re successful and they’re happening within 200ms, or whatever your latency threshold is for that service. And so the SLI is just the ratio of good things to total things like thinking about it very simply. So. I served 10,000 requests yesterday and 9,900 of them were successful. Your SLO or your SLI for yesterday was 99%, and your SLO is the threshold at which you want your SLI to be. So when someone says, I want a three nines service that doesn’t mean anything.
But if your SLI is about availability and latency, and you want three nines on that, that means in some given measurement time period (usually a month or 28 days) that 99.9% of the time doing the SLI ratio then we are meeting or exceeding our commitment. Then there’s the relation to SLA’s, which is the thing you put in a contract the customer is signing. The only difference between an SLA and an SLO is a lawyer. Meet an SLA, but miss an SLO and you should have retrospectives and write postmortems. You miss an SLA though and you probably write a check… Your SLOs, in a perfect world, should always be more aggressive than your SLA’s cause so you don’t want to find out that things are wrong after you’re writing checks and apologizing to customers.
How were you thinking about observability before joining Observe?
Pete: That’s a loaded question. There’s the 3 pillars, metrics, logs, and traces, but I think in my experience Observability validates what you’re doing right. Like if I run the service for production. Okay, does it work? You need Observability to answer that. You need the ability to answer if things are working, and if they’re not working, you generally need Observability to help you understand why. And if you’re doing capacity planning, you need Observability. Pretty much everything you do in my world, at least somehow leads back to Observability. If your Observability tools are broken, it’s very hard to do much of anything.
In this industry our needs are constantly evolving. It’s a complex topic, because things we have with Observability today that are super standard we never had 15 years ago. Not that we didn’t need them then, we just didn’t know they were a thing. No one had SLOs 15 years ago, but everyone needed SLOs 15 years ago. No one knew what they were.
What’s the difference you’ve experienced with Observe?
I think Observe does a lot of things that I’ve wanted out of Observability tools for a while. I think the biggest thing is data modeling. You know you want to look up a certain thing about a user doing a transaction. In other tools you just kind of know the magic string to type because you’ve done it a hundred times, not because it necessarily makes sense. You develop muscle memory around searching for data.
The difference in Observe, is that everything is much more explorable. And you can say “I have this user and I wonder what transactions they’ve made” and there’s a linked Dataset called transactions. That’s probably it. And in a legacy tool it’s like, “okay, what do I join on? Do I do it on request, ID or user, ID or transaction ID or like 2 of them or none of them?” And it’s just much less organic in other tools. People are pretty aware of the limitations in their Observability tools, and if you ask them what they don’t like, they’d be able to talk for hours about it. They may not know what they want, but they know what they don’t want. Hearing that all your data is in a single pane of glass doesn’t really mean anything to me. Being able to query all of your data with the same language and relate all of your data together does.