Why many observability products struggle with tracing data (hint: it’s the architecture)

By Rakesh Gupta,July 16, 2024

At Observe, we have tracing customers who:

  • Send us traces for complex asynchronous workloads, sometimes lasting more than 24 hours with thousands of individual spans.
  • Search for traces over a year in the past, to do historical analysis and for compliance.
  • Realize more than 50% cost savings for tracing use cases from their previous tracing tool.

Before we brought our Trace Explorer to market, all of this was unheard of in the tracing market. How did we make it possible when others couldn’t? The answers lie in our architecture, and in the fine print of other products’ pricing pages and documentation.

A trace with 8,000+ spans that models a complex, multi-stage async data pipeline in our own backend.

A trace with 8,000+ spans that models a complex, multi-stage async data pipeline in our own backend.

Trace assembly limits the types of traces that can be ingested

In many asynchronous workloads (think pub-sub architectures, data pipelines, batch jobs, etc.), each unit of work may be done minutes or hours apart. This is a problem for other tracing products. From New Relic’s docs on their tracing architecture:

“Once the first span in a trace arrives, a session is opened and maintained for 90 seconds. With each subsequent arrival of a new span for that trace, the expiration time is reset to 90 seconds. Traces that have not received a span within the last 90 seconds will automatically close. The trace summary and span data are only written when a trace is closed.”

That’s for their standard tracing architecture. Their “Infinite Tracing” architecture is even more limited than this. From the same docs page:

“The trace observer holds traces open while spans for that trace arrive. Once the first span in a trace arrives, a session is kept open for 10 seconds. Each time a new span for that trace arrives, the expiration time is reset to 10 seconds. Traces that haven’t seen a span arrive within the last 10 seconds will automatically expire.”

This process is known as trace assembly and even tracing-centric products such as Lightstep do this as their docs describe. In those products’ architectures, trace assembly at ingestion time is required in order for traces to be efficiently searchable and retrievable. This places limits on trace structure and duration, which in turn limits the use cases for tracing in those products.

If you were using a product that does trace assembly at ingest to monitor and inspect highly asynchronous & long-lived workloads, each transaction would be split up into many traces, making it much harder to visualize and troubleshoot.

Your vendor might tell you that distributed tracing is not designed for highly asynchronous or long-lived workloads, but that’s usually because their architecture can’t handle them. They may suggest the workaround of using OpenTelemetry span links to stitch together multi-trace workloads into a single logical unit. That means a considerable additional burden on the customer to manually instrument span links, plus some level of asking vendors to improve their minimal support for span links – most vendors are only listing the span links for a given span instead of supporting what customers really want, which is to visualize the entire transaction as a single trace. Good luck troubleshooting your asynchronous workloads one sliver at a time.

Tightly-coupled compute and storage make it hard to retain data for long periods

Datadog calls their tracing architecture Tracing Without Limits™. Let’s put it under the microscope.

From Datadog’s pricing page, one APM license gives you:

  • 150 GB / month of span ingest
  • 1,000,000 spans / month of span retention (equivalent to 800 MB / month*)

* In our customer base, the average size of a compressed span in protobuf format at the time of ingest is about 800 bytes.

They clarify the meaning of ingestion and retention in their pricing FAQ:

“Ingestion means sending your traces to Datadog and having all of them available for Live Search and Analytics for 15 minutes … Retention means storing your most important traces (e.g. error, high latency, or business-critical ones) and making them available for search and analysis during a retention period of your choice (15 days by default).”

In practice this means that <1% of your traces are available for longer than 15 minutes. Datadog customers have told us that they’ve had to continually customize the indexing rules for retention because the traces they needed to troubleshoot an incident weren’t being retained. Imagine learning which trace data you’re missing that you actually needed, one incident at a time.

If you need to retain more trace data in Datadog, you can do so for an astronomical price:

  • “7-day retention at $1.27 per million spans per month (billed annually)”
  • “15-day retention at $1.70 per million spans per month (billed annually)”
  • “30-day retention at $2.50 per million spans per month (billed annually)”

And since a compressed protobuf-encoded span at the time of ingest is again roughly 800 bytes on average, the effective prices are $1.58/GB for 7-day retention, $2.12/GB for 15-day retention, and $3.12/GB for 30-day retention.

Other products don’t fare much better:

In short, none of these products’ trace retention period is long enough for compliance or historical analysis use cases, and their tiered pricing structure suggests that even if they could offer longer retention, it would likely be prohibitively expensive or involve trade-offs. Their hard retention limitations, and the way their pricing scales with the retention period, suggests that they are leveraging an expensive hot storage system, perhaps SSDs, in order to make data searchable with latencies that users expect in observability use cases (i.e., near instantly). Some of these products offer cold storage for longer periods at lower costs, which introduces a data management burden on your teams, not to mention the data you need is no longer always available.

How Observe’s architecture enables fast, flexible, and economical tracing workflows

The differences between Observe’s architecture and those of legacy vendors begin at ingest. Our streaming ingest pipeline is data-agnostic; it accepts logs, metrics, and traces, as well as any other semi-structured or unstructured data, and leverages NGINX and Kafka clusters to efficiently load data into Snowflake within seconds. This architecture also ensures we can handle sudden spikes in data streams and prevent data loss, while scaling to over a petabyte of ingest volume per day for a single tenant.

The ingest pipeline is so efficient that we offer a usage-based pricing plan where ingest is free and you pay only for what you query (we also offer a more traditional ingest-based pricing model).

Snowflake compresses and stores all data in S3, whether you need it for 15 minutes or 13 months, and it never becomes “cold” – all data is always “hot” and ready for querying regardless of timestamp. Snowflake’s architecture employs a variety of strategies to achieve the near-instant query latencies that observability users demand while using S3 as the data store. Moreover, queries in Observe are compiled down into highly-optimized SQL that extracts the maximum query performance from Snowflake.

When you open a trace in Observe, under the hood we are simply issuing a query using OPAL, our powerful yet easy-to-use data processing language, that searches for spans that match the given trace ID. We make use of a variety of indexing strategies for data in Snowflake; in this case, we have an equality index that makes these queries fast and efficient, combined with a materialized view that continually maintains the start and end time of each trace as new spans arrive. Our use of Snowflake also takes advantage of that fact that for observability use cases, data tends to arrive roughly in the order it was generated, which means that micro-partitions in Snowflake will be largely clustered by time.

This means that we don’t need to do trace assembly at ingest time for tracing use cases – we store raw span data in our data tables and only fetch all the spans for a trace at query time. That in turn means we don’t need to place any limits on trace structure or duration.

Economical is a key word here. Take Honeycomb – they have a very compelling tracing-centric observability product with the best out-of-the-box retention and the fewest limits on trace structure of any of the aforementioned vendors. But let’s do the math and convert their pricing (which is event-based) into an equivalent per-GB ingest price:

  • Understanding event costs: ingesting 100 million events would cost $130 using their Pro plan pricing.
  • Defining an event: An event is a span, span link, or a span event, per the FAQ in their pricing page.
  • Average event size: as previously mentioned, the average size of a compressed, protobuf-encoded span at ingest time in our customer base is 800 bytes, which includes span link and span event data for that span.
  • Event composition: In our customer base, there are typically between 5 and 20 span links or span events for every 100 spans. To be generous to Honeycomb, we’ll take the lower value and assume that 95% of events are spans.

We arrive at this formula to find that Honeycomb effectively charges $1.71 / GB of span ingest:

$130 / (100 million events) * (95% of events are spans) * (800 bytes / span) * (1 GB / 1 billion bytes) = $1.71 / GB

This is many times our price for span data ingest, if you use our ingest-based pricing plan (our usage-based plan often has even better economics). Ultimately this means that if you use Honeycomb, you still need to make a series of trade-offs about how much to store and for how long. We’ve heard from customers migrating from Honeycomb that they had to sample so aggressively (sometimes more than 1,000 to 1) to make the economics work that they missed critical data. In other words, they faced the same problems that Datadog customers told us they encountered, namely that the data they really needed to get to the root cause of critical incidents was not available during troubleshooting.

Our unparalleled tracing capabilities are made possible by a modern architecture that separates storage from compute and eliminates the need for trace assembly at ingest, with economics that mean you’ll always have the data you need for incident troubleshooting, historical analysis, or compliance use cases.

Check out our whitepaper if you’d like to learn more about the advantages of our architecture over that of legacy products. Come take Observe for a spin yourself with a free trial!