How Skyscanner Runs OpenTelemetry Across 24 Clusters — And What It Means for Your Self-Hosted Stack

Skyscanner published a detailed post this week on how they manage OpenTelemetry Collectors across 24 production Kubernetes clusters. The architecture decisions they made — and the ones they avoided — are a useful stress test for any team planning a self-hosted observability stack on AWS. Here is what the design actually implies, where it breaks down at scale, and how we would adapt it for a Grafana/Loki/Tempo/Mimir backend.

What Skyscanner Actually Built

The short version: one OTel Collector DaemonSet per cluster, a second tier of gateway Collectors that aggregate and route, and a centralized configuration management layer that pushes Collector configs without requiring a redeploy. They are running this across 24 clusters, which means the configuration drift problem is real — a single misconfigured pipeline processor can silently drop spans or duplicate metrics across a significant fraction of their fleet.

The detail worth paying attention to is their approach to config distribution. Rather than baking Collector config into a Helm values file and letting GitOps handle it, they built a control plane that can update pipeline configuration at runtime. This is not exotic — the OTel Operator for Kubernetes supports this via OpAMPBridge — but most teams do not actually implement it, and the gap between "we have GitOps" and "we can push a config change to 24 clusters in under 60 seconds without a rolling restart" is significant during an incident.

The second detail: they are explicit about the DaemonSet-versus-sidecar tradeoff. DaemonSets mean one Collector process per node, shared across all pods on that node. Sidecars mean one Collector per pod. DaemonSets are cheaper (one process, not N processes) but noisier (a single tenant's high-cardinality metric burst can starve other pods sharing the same node-level Collector). For a company like Skyscanner with relatively homogeneous workloads, DaemonSet is the right call. For a platform team running multi-tenant Kubernetes where one team's service emits 10x the telemetry of everyone else, you will want per-namespace or per-deployment Collectors with resource limits, which means sidecars or a dedicated Collector Deployment per tenant namespace.

Translating This to a Self-Hosted AWS Stack

Let's make this concrete. The Skyscanner post describes the collection and routing layer. What it does not describe is the backend — which is where the real operational cost lives.

For a self-hosted stack on AWS, the backend tier looks like this:

[App Pods]
    │
    ▼
[OTel Collector DaemonSet] — per EKS node
    │
    ▼
[OTel Collector Gateway] — Deployment, 3-6 replicas, per cluster
    │
    ├──► Mimir (metrics, remote_write / OTLP)
    ├──► Loki (logs, via Loki exporter or Promtail)
    └──► Tempo (traces, OTLP gRPC)

[Grafana] ── queries ──► Mimir / Loki / Tempo

The gateway tier is where you make routing decisions: which tenant's data goes to which Mimir tenant, how you sample traces before they hit Tempo, whether you drop debug-level logs before they inflate your Loki storage bill.

Here is a representative gateway Collector config for this topology:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 400
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors-and-slow
        type: composite
        composite:
          max_total_spans_per_second: 500
          policy_order: [errors, slow-traces]
          composite_sub_policy:
            - name: errors
              type: status_code
              status_code: {status_codes: [ERROR]}
            - name: slow-traces
              type: latency
              latency: {threshold_ms: 500}
      - name: probabilistic-baseline
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

exporters:
  prometheusremotewrite:
    endpoint: "http://mimir-distributor.monitoring.svc:9009/api/v1/push"
    headers:
      X-Scope-OrgID: "${TENANT_ID}"
  loki:
    endpoint: "http://loki-gateway.monitoring.svc/loki/api/v1/push"
    default_labels_enabled:
      exporter: false
      job: true
  otlp/tempo:
    endpoint: "tempo-distributor.monitoring.svc:4317"
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]

A few things worth calling out in this config:

memory_limiter before batch — always. If you put batch first, a slow downstream (Mimir under write pressure, Loki gateway backpressure) will cause the batch processor to hold spans and metrics in memory until the Collector OOMs. memory_limiter first means the Collector starts refusing new data before it dies, which triggers backpressure up to the DaemonSet tier instead of a crash.

tail_sampling on the gateway, not the DaemonSet — tail sampling requires seeing all spans for a trace before making a keep/drop decision. If you run it on DaemonSets, a single trace can have spans on multiple nodes, so no single DaemonSet sees the full trace. The gateway tier aggregates across nodes, so it can make a correct sampling decision. This is the main architectural reason the two-tier design exists.

X-Scope-OrgID on the Mimir exporter — Mimir is multi-tenant by default. If you do not set this header, Mimir will either reject the write (if multitenancy_enabled: true) or dump everything into a default tenant, which makes per-team cost attribution impossible later. Set it at the gateway tier using an environment variable injected from a Kubernetes Secret, not hardcoded.

The Config Management Problem at 24 Clusters

Skyscanner's key operational insight is that config drift across many clusters is a silent failure mode. A processor you added to suppress a noisy metric in cluster 3 gets forgotten, and six months later you are debugging why cluster 3's dashboards look different from cluster 7's.

There are three realistic approaches here, in order of operational complexity:

Approach	Consistency guarantee	Rollout speed	Operational cost
GitOps (ArgoCD / Flux) + Helm	Eventually consistent, depends on sync interval	2–10 min per cluster, sequential or parallel	Low — you already have this
OTel Operator + OpAMP Bridge	Near-real-time push, Operator manages CRD reconciliation	30–90 seconds across all clusters	Medium — Operator per cluster, central OpAMP server
Custom control plane	Whatever you build	Whatever you build	High — you are now maintaining software

For most teams with fewer than 10 clusters, GitOps is the right answer. The sync lag is acceptable, and you get free audit history via Git. For 24 clusters, Skyscanner's investment in a more dynamic config layer starts to make sense — but only if you have already hit the pain of a bad config propagating slowly and causing an incident window that lasted longer than it should have.

We would not recommend jumping to OpAMP before you have felt that pain. Build the GitOps path first, instrument your Collectors to expose otelcol_processor_dropped_metric_points and otelcol_exporter_send_failed_metric_points as Prometheus metrics (they expose these by default on port 8888), and alert on them. That gives you visibility into config-induced data loss without building a control plane.

Where This Design Breaks

I want to be direct about the failure modes, because the Skyscanner post is necessarily a success story and does not dwell on them.

Tail sampling memory pressure. The tail_sampling processor holds incomplete traces in memory for decision_wait seconds (10 seconds in the config above). At high trace volume — say, 50,000 concurrent traces — each with 20 spans averaging 2KB, you are holding ~2GB of trace data in the gateway Collector's heap waiting for a decision. If your services have high tail latency (p99 > 10s), you either increase decision_wait and blow your memory budget, or you miss sampling the slow traces you most care about. Grafana Tempo's backend sampling (via tail_sampling in Tempo itself) can offload this decision to the backend, but it adds complexity and requires Tempo 2.x with the block builder component.

Loki label cardinality from OTLP. The OTel Collector's Loki exporter will, by default, promote OTLP resource attributes to Loki labels. If your services emit high-cardinality attributes as resource attributes — k8s.pod.uid, host.id, anything that changes per-pod or per-request — you will create a Loki label cardinality explosion that degrades query performance and inflates your index size. The fix is explicit label mapping in the Loki exporter config:

exporters:
  loki:
    endpoint: "http://loki-gateway.monitoring.svc/loki/api/v1/push"
    default_labels_enabled:
      exporter: false
      job: true
    resource_to_telemetry_conversion:
      enabled: false  # do NOT promote all resource attrs to labels

Then use structured metadata (Loki 3.x feature) for high-cardinality attributes you still want to filter on. This requires Loki 3.0+ and the Loki exporter in otelcol-contrib v0.97+.

Mimir write path saturation during metric bursts. If a deployment event causes a spike in active time series — a new label value, a new service emitting metrics for the first time — Mimir's ingester component will see a write amplification spike. Mimir's default ingester.ring.replication-factor is 3, meaning every write goes to 3 ingesters. A sudden 10x burst in active series can push ingesters into OOM territory if you have not sized them with headroom. We typically run Mimir ingesters on r6g.2xlarge instances (64GB RAM) with a per-ingester series limit of 1.5M and a cluster-wide limit of 10M, which gives us room to absorb deployment bursts without triggering the circuit breakers.

The Cost Reality

The Skyscanner post does not mention cost, which is understandable — it is an engineering post. But for anyone evaluating whether to build this versus staying on a SaaS platform, here are real numbers from a deployment we run for a client with comparable scale (20 EKS clusters, ~8,000 pods, ~2M active Prometheus time series, ~15TB of logs per month, ~500M spans per day):

Component	AWS instance type	Monthly cost (on-demand)
Mimir ingesters (6x)	r6g.2xlarge	~$1,740
Mimir store-gateway (3x)	r6g.xlarge	~$435
Mimir compactor (2x)	m6g.xlarge	~$220
Loki ingesters (6x)	r6g.xlarge	~$870
Tempo distributor + ingester (4x)	m6g.2xlarge	~$580
Grafana (2x)	t4g.medium	~$60
OTel Collectors (DaemonSet + gateway)	Shared node capacity	~$200 est.
S3 storage (logs + traces + metrics blocks)	S3 Standard + Glacier IR	~$1,100
Total		~$5,200/month

A Datadog bill for equivalent data volume — 2M custom metrics, 15TB logs, 500M APM spans — runs between $45,000 and $80,000 per month depending on contract and retention settings. That is not a made-up comparison; it is based on invoices we have seen during migration engagements.

The self-hosted stack is not free. There is real engineering time to operate it — we estimate 0.5 to 1.0 SRE FTE for a deployment at this scale, or roughly $80,000–$160,000 per year in fully-loaded labor cost. Even accounting for that, the math is not close.

What to Do With This

If you are running fewer than 5 clusters and under 500K active time series, the Skyscanner architecture is more than you need. Start with a single-tier Collector DaemonSet, a small Grafana Alloy deployment as your gateway, and a single-region Mimir/Loki/Tempo install on EKS. Get comfortable with the operational surface before adding tiers.

If you are running 10+ clusters or are actively migrating off Datadog or Splunk, the two-tier Collector design is worth implementing from the start. The gateway tier is where you will implement tail sampling, tenant routing, and the label normalization that prevents cardinality explosions downstream.

The OTel Collector config above is a starting point, not a production config. The tail_sampling policy in particular needs tuning against your actual trace volume and latency distribution before you commit to it.

If you want to talk through what this looks like for your specific cluster count, data volumes, and current vendor contract, we do that kind of architecture review at Etalon. No sales deck — we look at your current bill, your current data volumes, and tell you honestly whether the migration math works and what the operational cost will be. Reach out at etalon.systems.