O11y Investigator – Agentic AI to Drive Faster Resolution

By Qi JinSeptember 26, 2024

Investigating incident alerts can be repetitive and yield varying results. Each time an SRE receives an alert, they look up the associated runbook and follow the same steps repeatedly.. Often, they discover that the alert was merely system noise and doesn’t even result in any actionable insights about their system or find issues to address. Many times things will go wrong in distributed systems that require executing runbooks step-by-step and fixing the underlying root cause. In these cases, the time spent in triaging and troubleshooting slows down the resolution. 

The motivation behind building O11y Investigator was to simplify and accelerate incident investigations – how can a team of AI agents work alongside on-call engineers during investigations, identify the appropriate runbooks, and execute repetitive but crucial steps – automatically. This blog covers our unique approach to incident investigations and our design decisions to make AI agents as effective as possible.

To understand how AI is used, let’s walk through an investigation scenario starting with an on-call engineer getting an alert that a business metric – the number of unit sales in a particular region is down. They need to find the root cause of the incident and resolve it as soon as possible, but where do they start? The problem could lie in distributed Kubernetes infrastructure or any microservice involved in the execution path to complete order purchase. Was there a recent code commit that could cause this issue? Is an intervention required or self-healing nature of cloud-native apps will resolve it automatically. An experienced human team member, with some training, can think through all these scenarios and investigate, but our goal is to build an AI system that aids and collaborates with human on-call engineers to reason with all these scenarios and effectively drive investigations. 

Many solutions in the market take a zero-shot approach, giving the LLM a description of the alert and asking what could have gone wrong. While LLMs can process language exceptionally well, without specific knowledge of the system in question or the ability to interact with it directly, their insights are often limited. As a result, they tend to generate plausible-sounding responses, but these responses may not be grounded in the actual state of the system. This lack of specificity can lead to hallucinations—where the model generates incorrect or irrelevant information—and guesses that might sound logical but don’t advance the investigation of the actual issue.

Advantages of Agentic Workflow

A better approach is to AI the ability to call tools that can help it get the information it needs to surface useful information about your system to you. OpenAI, Anthropic, and other LLM providers use this approach to tell their LLMs what tools they have at their disposal, what arguments they take in, and get them to return a JSON with the tool’s arguments back.

The following is an example a tool specification for OpenAI:


```
{
    "name": "retrieve_dataset",
    "description": "Get the most relevant OPAL dataset for this incident.",
    "parameters": {
        "type": "object",
        "properties": {
            "incident_description": {
                "type": "string",
                "description": "The description of the incident.",
            },
        },
        "required": ["order_id"],
        "additionalProperties": false,
    }
}
```

After telling the model that it can call this tool, OpenAI’s model would return a JSON that contains a list of tool calls like this:



```
{
…
"tool_calls": [
                    {
                        "id": "call_82147235",
                        "type": "function",
                        "function": {
                            "arguments": "{incident_description: 'The APIServer was OOMKilled.'}",
                            "name": "retrieve_dataset"
                        }
                    }
                ]
…
}
```

Delegating To Different Agents

The first instinct after learning about LLM tools might be to just give a single AI agent a bunch of tools and call it a day. However, in order to get good performance, you would need to spend a large amount of your input context length just explaining the mechanics of “how to write an OPAL query”, “how to read through code commits”, “how to use kubectl”. In addition, if the task at hand requires writing an OPAL query, then the other instructions are a waste of money or higher latency at best, and a confusing distraction to the LLM at worst. Thus, it is useful to have an individual agent for writing OPAL, another for reading code, and another for understanding your Kubernetes infrastructure. Once you have multiple agents, you then need an orchestration agent that figures out which subagent to call for different tasks. Each subagent can be described to the orchestration agent as a tool to call, where the instruction to the subagent can be filled in as an argument to the tool.

Planning and Reacting

The difference between an agentic workflow and a normal LLM application is that we expect AI agents to exhibit some form of reasoning, take actions in their environment, and react to changes in their environment caused by their actions. Eliciting “reasoning” behaviors from LLMs is still a very active field of research, but one common approach to increasing reasoning quality is to simply ask the LLM to plan out the actions it wants to take, thinking step-by-step. This investigation plan has two purposes; it increases the agent’s ability to select the appropriate tool, and it provides human investigators looking at the shared investigation notebook the ability to understand what the AI agent is trying to do. However, simply asking it to plan all of its actions and then execute them one-by-one according to the static plan means the agent is unable to react to changes in its environment. For example, during the course of the investigation, our AI investigator might find out important information that affects the subsequent actions it should take (e.g. it might find that one service has a lot of errors so subsequent tool calls should focus on that service specifically). To solve this issue, we ask the LLM to replan what it wants to do before it calls each tool.

Our AI investigation workflow thus starts with the AI agent planning what it wants to do, then the orchestration agent re-planning and calling a tool or sub-agent, which ends up adding a new block to the investigation notebook, and then the orchestration agent reading the newly updated notebook and starting the loop again, until the orchestration agent finally calls the tool we’ve given it that symbolizes that the flow is complete.

Get Started with O11y Investigator

By using advanced Agentic AI approach to reason, make decisions, and coordinate complex workflows, O11y Investigator drives faster and precise incident investigations and resolution, allowing your teams to reduce MTTR and deliver better customer experiences. O11y Investigator is available in preview. If you are not already using Observe, start your journey with a free trial today.