Migrating from SPL to OPAL

By Jack CoatesJanuary 17, 2024

It’s no secret that every organization is interacting with customers and suppliers through the Internet. Furthermore, new microservices architectures are increasing the complexity of applications. Microservices change in production every day, potentially giving rise to unpredicted behaviors. These ephemeral services are also very difficult to manage in pre-cloud tools, because the volume of telemetry and log data they generate is orders of magnitude larger. Legacy tools were not designed for the volume or complexity of investigating unknown problems in distributed applications, so their costs are soaring while their utility drops.

Organizations have been working with their legacy vendors to squeeze more life from their investment by moving the same architectures to the cloud. Unfortunately, the cost realities of that lifting and shifting reduce the maximum data ranges and search complexity, so everyone has to scramble for solutions. Data filtration, sampling, and rerouting with tools like Cribl or ObservIQ in the ingest pipeline, reduced retention times, and home-grown hierarchical storage rotation mechanisms are the most popular approaches; but the demand for more data at higher fidelity just keeps growing. You need an architecture solution to an architecture problem.

One way that organizations have been trying to solve this is direct data lake usage: putting everything into Snowflake, Databricks, or Clickhouse directly and then writing SQL whenever the data needs to be used. There are some drawbacks to that path, some of which are addressed in this article about SIEM migration from Splunk to SQL. We’d like to expand on the non-streaming issue brought up there, with a path that avoids the issues with converting streaming data processing languages to non-streaming database queries. The economic motivations for switching your data backend to a system like Snowflake are clear; but how do you avoid losing the value captured in your SPL apps and searches?

At Observe, we have invested heavily in solving that exact problem, writing our own powerful streaming language to run over the top of Snowflake’s backend. OPAL, the Observe Processing and Analysis Language, exists so that security analysts can perform SIEM tasks without having to shift their mental model from streaming queries to static SQL. OPAL allows you to stream and accelerate datasets, perform search-time or advance transformation, and easily convert your desired outcomes from Search Processing Language. 

New languages can take some effort to learn of course, so another critical area of investment for us is O11y GPT, our AI helper bot. The more that we can do to ease friction, the better! Observe’s O11y is very popular with our users, who use it regularly as a language reference and search editing copilot. In turn, we’ve been able to use those experiences to drive language improvements, providing new functions and verbs where they’re useful to meet real world expectations. Our own developers and data engineers use this technology as well as our customers, and we’re continually improving its ability to translate from other languages to OPAL.

O11y GPT helps with SPL translation

O11y GPT helps with SPL translation

Example Search Translations

Now, let’s walk through some common Splunk security search examples to see how Observe can achieve the same outcome as the SPL with minimal translation required.

Example 1

In the first example, we’ll look for a running service named WZCSVC on a critical host with a category tag containing the characters “pci”.

SPL

Name=WZCSVC State=Running host_priority=critical host_category=*pci* 

OPAL

filter Name=WZCSVC and State=Running 
    and host_priority=critical and host_category~pci

There’s two differences you might notice here: in OPAL we explicitly use a filter verb, which allows great flexibility in including or excluding data, and that verb uses explicit Boolean operators instead of assuming.

Example 2

The second example is a brute force detection: it looks for more than six failed authentication events grouped with at least one successful authentication, and it filters out Windows service accounts (names ending with $). This search must be run in a time-sensitive fashion, because it will false-positive if given too many minutes of data to look at.

SPL

tag=authentication NOT (action=success user=*$)
| fillnull value=unknown action,app,src,src_user,dest,user 
| chart count over src by action 
| search failure>6 success>0

OPAL
filter tag=authentication and NOT (action=success and user ~ "$")
make_col action:if_null(action, unknown), app:if_null(app, unknown), 
    src:if_null(src, unknown), src_user:if_null(src_user, unknown), 
    user:if_null(user, unknown), dest:if_null(dest, unknown)
statsby count, src, group_by(action)
filter failure>6 and success>0

In OPAL we make columns and then use functions to say what data goes in them. Make column src, fill it with the value of src if that already exists, or null if src doesn’t already exist.  

Example 3

Let’s take it up a notch – this example looks at a common SIEM technique. Because the bulk of raw access logs is too voluminous and high cardinality to directly apply correlation searches to, the SIEM will extract attributes (or facets, if you’re coming from an operational metrics model) into a separate dataset. In Splunk Enterprise Security, that’s a multi-part operation: first we’re going to structure the events into an accelerated datamodel (which is beyond the scope of this document), then read events back out of that datamodel and write them to a lookup table, and then finally we’re going to use the lookup to ask if there’s a match with a Tor nodes threat list.

SPL Lookup Generator – maintain a lookup tracker by reading from a datamodel

| tstats summariesonly=true min(_time) 
as firstTime,max(_time) as lastTime from datamodel=Network_Traffic where
     All_Traffic.action=allowed by
     All_Traffic.transport,All_Traffic.dest_port 
| `drop_dm_object_name("All_Traffic")` 
| inputlookup append=T port_protocol_tracker 
| stats min(firstTime) as firstTime,max(lastTime) as lastTime by src,dest
| outputlookup src_dest_tracker

SPL Correlation Search – use the lookup to ask a question
| inputlookup append=T src_dest_tracker 
| lookup local=true ip_tor_lookup src OUTPUTNEW src_ip as src_tor_ip,src_is_tor 
| lookup local=true ip_tor_lookup dest OUTPUTNEW dest_ip as dest_tor_ip,dest_is_tor 
| search dest_is_tor=true OR src_is_tor=true 
| eval tor_ip=if(dest_is_tor=="true",dest_tor_ip,tor_ip) 
| eval tor_ip=if(src_is_tor=="true",src_tor_ip,tor_ip) 
| fields + sourcetype,src,dest,tor_ip

OPAL - both operations in one step
timechart count:count_distinct(name), group_by(local_address,remote_address)
make_col src_64:int64(remote_address),
remote_prefix_mask:floor(int64(ipv4(remote_address))/pow(2, 16),0)
make_col dest_64:int64(remote_address),
remote_prefix_mask:floor(int64(ipv4(local_address))/pow(2, 16),0)
join on (remote_prefix_mask = @"Threat Intel/Unified IPv4".tip_ip_prefix_mask 
    and (src_64 >= @"Threat Intel/Unified IPv4".tip_ip_range_start 
    and src_64 <= @"Threat Intel/Unified IPv4".tip_ip_range_end) 
    or (dest_64 >= @"Threat Intel/Unified IPv4".tip_ip_range_start 
    and dest_64 <= @"Threat Intel/Unified IPv4".tip_ip_range_end)),
    tip_provider:@"Threat Intel/Unified IPv4".tip_provider

In Splunk that multi-step work is necessary to handle scale and figure out temporal relationship problems. Observe was built with high scale and temporal resource knowledge, and you can do temporal relationships in one move. What was talking with what, did it involve Tor – it’s just a dataset definition.

Example 4

One last example – what if you want to find out what percentage of your traffic is exchanged with Tor nodes? We’ll leave out the evaluations from Example 3 from both of these snippets.

SPL

...

| eventstats count as "totalCount"
| eval withTor=coalesce(src_tor_ip,dest_tor_ip)
| eventstats count as "withTor" by withTor 
| eval percent=(withTor/totalCount)*100 
| stats values(withTor), values(percent) by totalCount

OPAL

...
statsby totalCount:count(), 
    group_by(tip_provider, local_address, remote_address, src_tor_ip, dest_tor_ip)
make_col total:window(sum(Count), group_by())
make_col percentage:100*Count/total

Meeting Security Team Requirements

Let’s take a step back from translating streaming languages though, and ask what the security team’s needs are. 

  1. The most important thing any security analyst needs, whether they’re in threat hunting or detection response, is data. As much data as possible: from all the sources, for all the time intervals, even if that data wasn’t considered security relevant when it was generated. 
  2. The second most important thing? Powerful, intuitive search. This is the combination that made Splunk valuable, and it’s the combination that Observe on Snowflake provides. Using different tools for threat and detection is like using different tools for operations and security, it looks good on a spreadsheet and fails in the real world.

If the threat hunter has to switch platforms to write a detection rule that then won’t work anyway because the data is missing… that’s a real world fail.

If the SOC analyst’s escalation can’t be used directly by the incident responder because they’re working in different languages on different data sets… that’s a real world fail.

Customers need to work with security and operational data in one tool so they can pivot between contexts as needed. Our research [State of SecOBVS] indicates that half of incidents get escalated; each one of those means a human has to review data, understand context, and make decisions. How is that ever going to work well across two different data lakes with two different search languages? Migration to cloud technology and adoption of servers-as-cattle methodologies was supposed to reduce the need for security incident escalation by providing hermetically sealed tech stacks; but that migration is at least halfway done, and it hasn’t slowed security agent deployments. Security analysts should be able to work in the same datasets with the same visibility as their operational counterparts.

Why You Want To Migrate From Your SIEM

The thing is, everyone uses SIEM, but everyone also hates their SIEM – it’s like the service desk of security. SIEM has been positioned as a content and integration-rich entry point that gives access to dozens of rules and add-ons specific to the other products that your organization runs on. The reality is every integration has versioning and configuration requirements, every rule only works with properly abstracted data, and every alert expects that the customer can decide if it’s important or not. SIEMs are a costly sink of resources. Security data has to be transformed before the SIEM can use it, no one knows if that transformation breaks, and no one has the time to look at most of the alerts anyway. TL;DR – SIEMs don’t work without pro services.

At Observe, we have always viewed Observability as a data problem, so we created a data company. Observe has a single, integrated, product: The Observability Cloud, which eliminates the silos of logs, metrics, traces by storing all data in a true Data Lake based on low-cost cloud storage: Observe also understands time and resources, which makes data understandable and navigable by humans in ways that older products struggle to achieve. Context is critical in any investigation, and Observe makes it immediately accessible across far longer time ranges and far greater complexity. 

Organizations are hungry for a clearer answer to security than the traditional SIEM. Observe can be that answer. We’ll start from a clean slate, work with your data in its native form, and focus on what you need.