Observability at Observe: Architecture And Monitoring Cases

By Daniel OdievichMarch 5, 2024

Introduction

Observe provides our highly scalable and available Observability Cloud to customers worldwide, allowing teams to build, run, deploy, and monitor their software at Internet scale. Observe’s charter is to provide our users with useful insights from their seemingly disparate and never-ending mounds of machine and user data.

To run this platform at scale, Observe uses its own offering to monitor itself. We call that “Observe on Observe” or O2 for short. Self-hosting our stack and validating everything makes a big difference to the quality of our customer offering. We last wrote about O2 in 2022, parts one, two, and three; I’m excited to write about what’s changed!

This series of articles will describe how our O2 tenant is architected, how our engineers use it for self-observation, what product managers do to gain insights into customer behavior and how we optimize it for speed and efficiency.

In this 1st part of the series, we describe Observe architecture and explain various use cases for observability that we use the O2 tenant for.

In the 2nd part of the series, we’ll explain how Observe combines performance and business data into an interactive, connected map that we call the Data Graph, and how our engineers and product managers use it to manage performance, features and costs.

In the 3rd part of the series, we will explore some of the more technical aspects of keeping Observe running smoothly, including tracking user activity, monitoring our service provider Snowflake, ensuring the health of data transformations and other aspects of engineering and business observability.

In the 4th and last part, we describe the recent optimizations of the Observe on Observe environment that improve the platform performance and decrease operational expenditures.

Observe Architecture in Brief

Observe consolidates siloed data into a single, low cost, Data Lake. It then transforms that data into a graph of connected datasets so users can easily find relevant context during an investigation. For a high level intro, take a look at our architecture

Observe offers a variety of endpoints for many common formats for sending observability data to Observe. The data is routed through a data buffering system (Kafka) before micro-batching and loading into Snowflake. This scalable component is described in detail in Observability Scale: Scaling Ingest to One Petabyte Per Day.

Once in Snowflake, the “schema-on-demand” data acceleration transforms the stream of incoming events to the desired format as defined by prebuilt Observe-supplied apps or the customers themselves. 

Observe provides many prebuilt apps for AWS, GCP, Azure, Kubernetes and many others and offers options for custom configuration-as-code using Terraform.

Observe Data Acceleration Process

The monitors continuously watch over these streams of incoming data and alert customers of any issues via emails or webhooks to external channels such as Slack and PagerDuty.

A rich browser-based user interface allows users to search through common observability artifacts like logs, metrics and traces. On top of that, it provides for display and exploration of highly connected business data in a myriad of possible ways. It makes it easy to build rich dashboards and add sophisticated monitors to suit the needs of customers. 

For integrations both in and out of the platform, Observe offers pollers, data sharing and a rich API including data export and metadata.

All Observe components are containerized and hosted in a Kubernetes container orchestration platform on the public cloud.

The Observe storage layer is Snowflake, a highly reliable and scalable cloud data warehouse. 

Observe maintains multiple deployments worldwide, co-locating with the most common customer Snowflake deployment locations. 

Observe offers both managed service and Snowflake Connected App model deployment

For more details on the Observe architecture, including deeper dives into the application components, more details on acceleration contents and how we use Snowflake for data management, check out our “How Observe Uses Snowflake to Deliver Observability Cloud” series of articles:

  • Part 1 for overview and ingestion descriptions
  • Part 2 for data modeling and data acceleration
  • Part 3 for resource management and Snowflake

Observe on Observe Use Cases

Observe uses its own services to monitor and manage the platform that we provide to customers (AKA “drinking your own champagne”). All telemetry emitted by Observe components go into this Observe on Observe/O2 platform. 

O2 is then used for many things, including these key aspects:

  • DevOps style infrastructure and application monitoring and analytics providing a top to bottom view into the health and performance of our global deployments
  • Data lake of both business and infrastructure data, including customer usage data, allowing for real-time analytics
  • Development playground at scale, where new features are built, tested and validated before heading to a wider customer adoption

Infrastructure, Application and DevOps Monitoring

All Observe components incorporate rich structured logging, emit infrastructure and business metrics, and create detailed distributed traces in the OpenTelemetry standard, even including real user monitoring from the Observe web-based UI.

Observe has custom pollers for our service providers, such as AWS cloud and Snowflake database, so we can stay ahead of any performance degradations.

Observe is containerized and run by Kubernetes in the AWS cloud. Observe Kubernetes monitoring stack is used to send all relevant metrics, events and logs to O2. Observe AWS monitoring send useful metrics and logs for the public cloud components that we use, such as AWS EKS, S3 and RDS.

For Snowflake, we constantly polls the INFORMATION_SCHEMA for QUERY_HISTORY to monitor the queries that are still being run. Observe also queries Snowflake for the profiles of a carefully selected portion of the queries that completed, using returned data to dynamically adjust ongoing data acceleration and ad-hoc query activities. In this way we are keeping ahead of historical ACCOUNT_USAGE views such as  QUERY_HISTORY and WAREHOUSE_METERING_HISTORY, although we use those too for both monitoring and historical investigations.

Once in O2, this data provides rich analytics over various components in the system for our DevOps engineers, including views such as:

  • Understanding machine and process performance (CPU, memory usage, network throughput)
  • Performance of Kubernetes clusters, nodes, pods and their interplay
  • Kafka configuration and performance (memory consumption, CPU usage, bytes in/out, topic throughput)
  • Observe component performance (memory, app metrics, transactions throughput)
  • Snowflake performance (virtual warehouse utilization, free pool management, workload packing, clustering jobs, index maintenance)
  • Amazon cloud performance (databases, storage, networking, security)
  • Software deployment events (source control activities, binary builds, image packaging, code signing, releases)

For example, “kubernetes/Go Collector Metrics” dashboard provides infrastructure and memory metrics crucial to performance of the key ingestion pipeline component:

Observe Collector Metrics

Data Lake for Business Data

The 3 well-known pillars of observability (metrics, traces, and logs) provide the raw materials that we send to O2 to enrich and connect. Snowflake-management views and polled data from our own metadata provide more building materials. Even our public documentation web site sends information on the use of the web site, such as page usage and search terms.

From all of this, O2 builds the data model that describes major features of our platform, tracks customers, and derives many business-related metrics related to the uptime and usage of our platform. We send the streams of various events to O2, clean and enrich the data, then link it all together into a Data Graph, which consists of Datasets curated from the Data Lake to make it easier to navigate and faster to query. Datasets represent “things” that users want to ask questions about. They can be business-related, such as customers and shopping carts, or infrastructure-related, such as pods, containers, and S3 buckets.

Here is an example of the relationships we build around the observe/Observe User dataset, representing a single user in our platform, and linked to hundreds of other logical entities in the platform:

Observe User Resource Lineage

The business data-focused lake allows for rich exploration of customer behavior, helps us manage customer performance at optimal levels, provides ability to ensure most efficient use of resources, and generally informs development of our platform.

For example, “Metric Usage Analysis” brings together metric time series, dataset and usage data for a single customer:

Observe Metric Usage Analysis

Development Playground at Scale

O2 also represents a rich development playground where all new features are evaluated for fit and scale. 

The features begin their life in individual engineering environments and get validated in various test and canary deployments before being exposed to the wide world. 

They all get large-scale validation in O2, where the test drives by test engineers, product managers, field-facing architects and sales engineers provide real-world validation of the features being developed. Many features get multiple rounds of feedback and refinement before they are allowed to proceed to customer acceptance. 

For example, our recent work on reimagining how to display OpenTelemetry traces that culminated in Trace Explorer was deployed to O2 as a first customer and rapidly iterated with input from backend engineers who used it to keep the platform running. Their honest and demanding feedback was quickly incorporated to deliver the best product to customers.

In another example, the OPAL Copilot feature is used by our customers to help with building dataset and dashboards and with understanding of their data. The “O11Y GPT and OPAL Complete” dashboard are built by engineers of this feature to track current activity and improve the models:

Observe O11yGPT Dashboard

Conclusion

Managed solutions live and die by their quality and uptime. Quality and uptime are directly related to what you can measure and how you manage it. To measure something you have to observe it. 

Observe provides a highly available, immensely configurable, easy to use observability solution to our customers. The self-usage of the product offers great visibility into technical and product usage details to product engineering and product management. At the same time, it contains a wealth of useful data on customer usage for our field engineering and sales teams. 

The Observe O2 environment acts as a test bed for our innovation, resulting in direct and rapid improvements of our service to our swiftly growing customer base.

Up Next

Soon we’ll post the 2nd part of this series, where we will explain how Observe combines performance and business data into an interactive, connected map that we call the Data Graph, and how our engineers and product managers use it to manage performance, features and costs.

Want to try Observe yourself? Come on down!