DevOps11 min

Observability for SaaS: metrics, logs, traces, and the tools that matter

A working guide to the three pillars of observability for SaaS — metrics, logs, and traces — covering Prometheus, Datadog, Loki, Honeycomb, OpenTelemetry, sampling, cardinality, and SLOs.

Observability has quietly become one of the largest line items on a SaaS infrastructure bill. Teams start with a generous free tier, add custom metrics to debug one production issue, and eighteen months later they're writing a $12,000 invoice to a vendor they chose in an afternoon. The three pillars — metrics, logs, and traces — each have their own failure modes, their own mature tooling, and their own cost levers. This post walks through the stack worth running in 2026, the decisions that actually control spend, and the instrumentation patterns that make debugging production issues tractable.

The three pillars, briefly

Metrics are numeric time series: request counts, latencies, error rates, queue depths. They're cheap to store, fast to query, and best for alerting and dashboards. Logs are unstructured or semi-structured events: one line per thing that happened, usually with context. They're expensive at scale and best for forensic work — answering 'what did user 471 do between 2:03 and 2:07?'. Traces are causally linked spans across services: a request enters the API, fans out to a cache check, a database query, two downstream services, and returns. They're the right answer to 'why was this single request slow?' and they're where distributed systems debugging actually lives.

A healthy SaaS runs all three. Metrics answer 'is something broken?', traces answer 'what's broken?', and logs answer 'what exactly happened?'. Teams that skip any of the three eventually pay for it in an incident they can't diagnose.

The tools that matter in 2026

PillarOpen source / self-hostManaged mid-marketManaged enterprise
MetricsPrometheus + GrafanaGrafana Cloud, New RelicDatadog APM
LogsLoki, Elasticsearch, VectorAxiom, Logtail (Better Stack)Datadog Logs, Splunk
TracesTempo, Jaeger (via OTEL)Honeycomb, Grafana CloudDatadog APM, Lightstep
UnifiedGrafana stack (Mimir+Loki+Tempo)Grafana Cloud, SigNozDatadog, New Relic One

The honest short take: Datadog is the best single-pane experience money can buy, and the bill scales with that quality. Grafana Cloud is the strongest mid-market choice and its consumption pricing is easier to model. Honeycomb remains the best-in-class option for traces and event-based debugging. Self-hosted Prometheus, Loki, and Tempo are genuinely viable for teams with SRE capacity and a serious cost ceiling — they stop being viable when the team has two backend engineers and no one who wants to own the observability stack.

Tool choice matters less than instrumentation quality. A well-instrumented app on Prometheus + Grafana debugs incidents faster than a poorly instrumented one on Datadog. Start with OpenTelemetry as the collection layer and keep tools swappable.

OpenTelemetry — the collection layer to commit to

OpenTelemetry (OTEL) is the CNCF standard for generating metrics, logs, and traces. In 2026 every major observability vendor consumes OTEL natively. That means instrumenting once and swapping backends later is tractable in a way it wasn't five years ago. For a Node.js service, the auto-instrumentation package covers HTTP, Express/Fastify/Koa, pg, mysql, Redis, and most common libraries without code changes. A minimal setup in a Node.js app looks like this:

// otel.ts — load this before anything else: node --import ./otel.js src/server.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { resourceFromAttributes } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: "api",
    [ATTR_SERVICE_VERSION]: process.env.GIT_SHA ?? "dev",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + "/v1/traces",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + "/v1/metrics",
    }),
    exportIntervalMillis: 15_000,
  }),
  instrumentations: [getNodeAutoInstrumentations({
    // Noisy — disable unless you care about filesystem spans
    "@opentelemetry/instrumentation-fs": { enabled: false },
  })],
});

sdk.start();
process.on("SIGTERM", () => sdk.shutdown().catch(console.error));

Auto-instrumentation gets a team most of the way there in an afternoon. Custom spans for business logic — 'bill customer', 'run background job', 'send webhook' — are worth adding manually. A span per meaningful unit of work makes traces readable and pairs naturally with SLO targets on those same operations.

Cardinality — the metric bill killer

The single biggest mistake teams make with metrics is high-cardinality labels. Every unique combination of label values creates a new time series. Labeling requests with user_id or request_id turns a single metric into millions of series, and the bill scales with series count, not request count.

Never put user IDs, request IDs, full URLs, session tokens, or anything else unbounded in metric labels. Those belong in traces and logs. A single high-cardinality label can 100x a Prometheus bill or crash the scrape target outright. The pattern is: bounded labels on metrics, unbounded context on traces and logs.

Healthy label cardinality looks like: http.method (7 values), http.route (~100 values, templated like /users/:id), http.status_code (~50 values), region (~5 values). That's ~175,000 theoretical combinations, of which maybe 2,000 are hit in practice. Adding user_id to the same metric turns 175,000 into hundreds of millions — and that's the difference between a $200/month bill and a $20,000 one.

Sampling for traces, not for metrics

Tracing every request at scale is often unaffordable and usually unnecessary. Sampling strategies cut cost 10–100x while preserving debuggability, if done right. Head-based sampling (decide at the start of a trace) is cheap and simple but loses the ability to always keep errors. Tail-based sampling (decide after the trace completes) is smarter — it keeps 100% of errors and slow requests plus a percentage of normal ones — but needs an OTEL Collector in front to do the decision.

  • Always keep: all traces with an error, all traces above the p95 latency threshold, all traces for a handful of critical business flows (payment, signup, checkout).
  • Sample: 1% or 0.1% of normal successful traffic. Enough to see patterns; not enough to hurt the bill.
  • Never sample metrics. They're aggregate by construction; sampling breaks the math. Sample traces and logs; aggregate metrics.

SLOs — the one dashboard the team actually uses

Service level objectives (SLOs) are the discipline that makes observability useful to the business. An SLO expresses a target — 'p99 checkout latency under 800ms over rolling 30 days' — and an error budget derived from it. Every team that's instituted SLOs well reports the same thing: priority conversations stop being anecdote-driven. The error budget either has room for a risky deploy or it doesn't.

Three SLOs cover most SaaS products: availability (successful request rate), latency (p95 or p99 response time for critical endpoints), and correctness (background jobs succeeding, webhooks delivered within a window). Anything more granular is usually debugging, not a business target. Pyrra and Sloth are solid open-source generators if running Prometheus; Datadog, Honeycomb, and Grafana Cloud all have first-class SLO UIs.

Log cost control in two patterns

Log spend is where the bill creeps in unnoticed. A chatty debug log left on in production, an error that fires 10,000 times an hour, a third-party SDK that writes a line per HTTP call — any of those can double a monthly bill quietly. Two patterns keep it under control.

  1. Leveled logging with environment-aware defaults. Production runs at info, staging at debug, local at trace. Never ship debug-level logs to production for more than the length of an incident investigation.
  2. Sampled error logs. When the same error fires repeatedly, log the first occurrence with full context and increment a counter metric for the rest. The metric captures the pattern; the log line captures the shape. This alone often cuts log volume 5–10x on noisy services.

A concrete stack for different stages

Rough guidance based on what's served our client projects well:

  • Early stage (0–$50K MRR): Grafana Cloud free tier, OTEL auto-instrumentation, Better Stack or Axiom for logs. Monthly bill under $50.
  • Growth (up to $500K MRR): Grafana Cloud paid tier or Honeycomb for traces, Axiom or Datadog Logs, full OTEL setup with an OTEL Collector. Budget $500–2,000/month.
  • Scale ($1M+ MRR): Datadog or mature self-hosted Grafana stack, dedicated SRE or platform engineer, formal SLOs with error budget policies. Budget scales with infrastructure, usually 5–10% of infra cost.

Pick the tool the on-call engineer can actually use at 3am. Fast query performance, readable dashboards, and a UI that loads quickly matter more than feature checklists. Every observability platform feels fine on a good day; the differences show up during incidents.

Key takeaways

  • Run all three pillars. Metrics for alerts, traces for debugging, logs for forensics — each has gaps the others fill.
  • Commit to OpenTelemetry as the collection layer. It's the only realistic path to tool portability in 2026.
  • Metric cardinality is the number one cost lever. Never label metrics with user IDs, request IDs, or anything unbounded.
  • Sample traces and logs aggressively; never sample metrics. Keep 100% of errors and slow traces, a small percentage of the rest.
  • SLOs convert observability into business signal. Three SLOs (availability, latency, correctness) cover most SaaS products.
  • The tool that works at 3am for the on-call engineer is the right tool. Nothing else in the checklist matters if that's not true.
#observability#opentelemetry#prometheus#datadog#honeycomb#slos#devops
Working on something similar?

Let's build it together.

We ship production SaaS, marketplaces, and web apps. If you want an engineering partner — not a consultancy — let's talk.