The observability tax: why your Datadog-class bill is often an order of magnitude above self-host, with receipts

This is an opinionated cost teardown, not a vendor hit piece. We name the mechanisms that inflate hosted observability bills: custom metrics, indexed logs, APM cardinality, and seat-adjacent pricing. Then we show what moves when you own the stack on AWS.

A Datadog-class bill is rarely mysterious once you map it to units: indexed log gigabytes, custom metric time series, APM spans and trace retention, RUM sessions, and seats for anyone who needs more than a screenshot. Hosted vendors price those units at retail. On AWS with Grafana Mimir, Loki, Tempo, and OpenTelemetry, you still pay for bytes and CPU, but you are not buying margin-stacked SKUs per dimension. In assessments we run for enterprises spending $500K–$5M/year on observability SaaS, the same decision-quality signals (alerts, SLO burn, incident timelines) often land around 5×–15× lower annual cash outlay once ingestion discipline and retention match what Finance thought they were buying. The spread is not "open source is free." It is that SaaS bills compound on dimensions teams optimize for velocity, not for unit economics.

The 12× mental model is not a guarantee

We are not pasting a vendor invoice. We are naming the line items that show up on every enterprise assessment we do: log indexing volume, metric cardinality after "one dashboard per team," trace ingestion without sampling discipline, and cloud integrations that create silent cardinality multipliers. When we model a governed OSS stack on EKS for the same operational questions, we repeatedly land near an order of magnitude lower annual cost, with engineering time counted honestly in the model. Your multiplier will differ if you ship high-cardinality labels to APM, if Legal insists on multi-year hot retention for everything, or if you need a vendor SOC attestation on day one.

Four bill multipliers (and what moves on AWS)

1. Indexed logs. SaaS log products often charge heavily for ingest and indexing. Teams that log structured JSON at info level in hot paths can burn terabytes per month without noticing. Loki on S3 with sane retention tiers and query patterns that use labels (not full-text on everything) moves the cost curve toward object storage plus query compute, not per-gigabyte indexing rent.

2. Custom metrics and tag cardinality. Every user_id, request_id, or tenant_id on a metric that was meant to be a counter becomes thousands or millions of active series. Mimir and Prometheus make that pain visible in scrape configs and TSDB head size. SaaS makes it visible on the invoice, often after the fact. The fix is the same in both worlds: drop high-cardinality labels from metrics, aggregate at the edge, or push those dimensions to traces/logs with sampling.

3. APM and traces. Unsampled tail traffic at tens of thousands of spans per second is a money fire in any backend. Hosted APM prices that fire per ingested span. Self-hosted Tempo still needs object storage, compaction, and query nodes; the difference is you can cap blast radius with collector-side tail sampling and retention without renegotiating a contract.

4. Seats and read paths. "Everyone gets a login" is a political cost, not a technical one. Grafana OSS/Enterprise on your VPC can still need license planning, but you are not pricing per seat for a senior engineer to run a read-only PromQL query during an outage.

Receipts in spirit: a tiered comparison table

Numbers below are illustrative, rounded from several anonymized migrations (EU and US SaaS spend, USD, annual). They are useful for order-of-magnitude sanity checks, not for your CFO's budget lock without your own inventory.

Cost driver	Typical SaaS-heavy posture	Governed self-host on AWS (same rough signal quality)	What actually changed
Logs (2–5 TB/day hot-ish query)	High (indexed volume + retention)	~5×–10× lower storage+query cash	Shorter hot retention, Loki + S3, aggressive drop rules, fewer duplicate pipelines
Metrics (10M–40M active series)	Very high (custom metrics + tags)	~3×–8× lower	Cardinality budgets, dropped labels, fewer "metric as log line" patterns
APM (~20k–80k spans/s sustained)	High per span	~4×–12× lower	OTel tail sampling, shorter trace retention, fewer auto-instrumented chatty libraries
People (platform + on-call)	Bundled into "managed" story	+0.3–1.0 FTE equivalent Year 1	Honest: you pay engineers instead of margin

The last row matters. If you model self-host as "only AWS bills," you will lie to yourself and lose the migration in month nine when upgrades and cardinality incidents land on a team that does not exist.

One concrete control: tail sampling in the OpenTelemetry Collector

This is not the only knob, but it is the one that prevents trace spend from tracking every health check and static asset request.

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 800
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling]
      exporters: [otlp/tempo]

If you ship this without ERROR and latency policies that match your SLOs, you will sample away the evidence you need for the next postmortem. The tax you avoided becomes an availability tax.

Architecture shape (one paragraph, no fairy dust)

In the stacks we ship, OpenTelemetry Collector (often DaemonSet + Gateway) fans out: metrics to Mimir (or Prometheus feeding remote write), logs to Loki, traces to Tempo, with S3 lifecycle policies as the economic spine. Grafana is the read path. AWS EKS, ALB, and KMS show up on the invoice too. None of that is free. It is usually predictable, which is what Finance actually wants after three years of "why did March spike?"

The silent multiplier: integrations you forgot you turned on

Cloud and Kubernetes integrations look like "free visibility" until they attach high-cardinality dimensions to every metric: pod name, container id, node id, availability zone, revision hash. Each dimension is defensible in isolation. Together they are how a medium-sized cluster walks into tens of millions of active series without anyone owning a budget line for "labels."

SaaS products monetize that explosion directly. Self-hosted Mimir still has to compact and query those series; the difference is you see the blast radius in TSDB growth and S3 block churn before the vendor sales cycle starts. The operational fix is boring: allow lists on labels at the collector, relabel_configs on scrape jobs, and a rule that new labels require a platform review the same way a new RDS instance would.

What actually lands on the AWS bill (so nobody confuses capex with opex)

Rough buckets we model in migrations, not as prices (those move by region and commitment) but as line items you cannot hand-wave away:

EKS control plane and worker EC2 (or Fargate if you chose that tradeoff): steady-state CPU for ingesters, distributors, compactors, queriers, and the collector fleet.
S3 storage and requests: the real long-term cost for Mimir blocks, Loki chunks, Tempo blocks once trace volume is real.
EBS for anything you keep hot on disk: WALs, local TSDB head, compactor scratch; this is where under-provisioned IOPS shows up as pager noise.
Data transfer across AZs and out to the internet if you mirror telemetry to a security tool or a second region.
KMS calls if you encrypt everything per object and forget request rates.

None of that is a moral victory over SaaS. It is a different contract: you trade unit-price opacity for capacity planning. Teams that win the cost argument bring the same rigor they use for data warehouses: growth curves, headroom, and quarterly reviews of the top ten cardinality offenders.

One more concrete artifact: drop a label before it hits the TSDB

Prometheus-style scraping can strip labels at ingest. The pattern is repetitive, which is why we centralize it in OpenTelemetry when we can, but many shops still have legacy ServiceMonitor flows.

metric_relabel_configs:
  - source_labels: [pod_template_hash]
    regex: .+
    action: drop

If you drop the wrong label, you lose drill-down from service to pod during an incident. If you drop nothing, you fund a cardinality museum. The adult version of this snippet is a list agreed with service owners, not a platform team playing whack-a-mole in prod on Friday.

How we talk to Finance without sounding like we are selling religion

The steering committee does not care about LGTM as an acronym. They care whether incident MTTR and audit retention survive the move. We translate the observability tax into three slides max: current annual vendor spend by line item, projected AWS run rate with pessimistic growth on bytes and series, and Year-1 engineering cost for build, cutover, and on-call coverage. If slide three is missing, we stop the meeting. A 12× headline without labor is how credibility dies.

When this framing is wrong

If you have no platform team, no appetite for upgrades, and no governance on instrumentation, you will spend the savings on incidents. The comparison is unfair if you pit "fully managed SaaS with a vendor NOC narrative" against "three YAML files and a dream."

Self-host also loses when regulatory or procurement demands a specific SaaS control attestation on a timeline you cannot meet in-house, or when your workload is tiny and the fixed cost of even a small EKS footprint dominates.

We have also seen migrations stall when query language lock-in was deeper than expected: saved searches, monitors, and embedded dashboards that were someone's entire job. The bill drops only after that work is scheduled, not because Grafana opens a portal labeled "savings."

Where to go next

If you want defensible numbers for your own shop, start from invoice line items and cardinality reports, not from a preferred dashboard vendor. Export active series counts, top label keys, ingest GB/day by service, and trace samples per second before you argue about 12× in a steering committee.

For a blunt review of observability spend against an AWS + LGTM + OTel target architecture, Etalon starts from invoices and cardinality, not from a slide template.