How Skyscanner Manages 24 OTel Collector Clusters — And What Self-Hosters Can Learn From It

Skyscanner's OpenTelemetry write-up landed last week and it's one of the more honest operational accounts of running OTel at scale that I've seen from a large engineering org. They're managing 24 production Kubernetes clusters with a shared collector fleet, and the problems they hit — config drift, cardinality explosions, tail-sampling coordination across cluster boundaries — are exactly the problems that bite every team that moves off a managed SaaS agent model onto self-hosted OpenTelemetry. This post pulls apart the architectural decisions Skyscanner made, maps them onto the AWS-hosted Grafana/Mimir/Loki/Tempo stack we run for clients, and is honest about where their approach breaks down at different scales.

This post pulls apart the architectural decisions Skyscanner made, maps them onto the AWS-hosted Grafana/Mimir/Loki/Tempo stack we run for clients, and is honest about where their approach breaks down at different scales.

The Core Problem: OTel Collectors Are Not Cattle

Every observability migration starts with the same assumption: we'll just deploy a DaemonSet collector per cluster, point it at the backend, done. That works until you have more than three or four clusters and more than one team contributing to the collector config.

At that point you have three compounding failure modes:

Config drift. Team A adds a filter processor to drop debug logs. Team B adds a batch processor with a 10-second timeout. Nobody removes anything. Six months later the collector pipeline has 14 processors and nobody knows what half of them do.
Cardinality explosions at the pipeline level. A service starts emitting a high-cardinality attribute — say, user.id on every span — and the collector happily forwards it. Your Mimir ingest costs triple overnight.
Tail sampling requires state. If you want to sample based on trace outcome (keep all error traces, drop 95% of success traces), the sampling decision has to happen after all spans for a trace are collected. That means a single span can't be evaluated in isolation on a node-local collector.

Skyscanner's solution to all three is a layered collector architecture with centralized config management. The specifics are worth examining.

Their Architecture, Reconstructed

From the blog post, Skyscanner runs two collector tiers:

Tier 1 — Agent collectors (DaemonSet, one per node): receive OTLP from applications, do light filtering and attribute enrichment, forward to tier 2. No sampling decisions here.
Tier 2 — Gateway collectors (Deployment, multiple replicas per cluster): receive from tier 1, apply tail sampling, batch, and export to the central backend.

This is a well-understood pattern — the OTel Collector documentation calls it the "agent + gateway" topology — but Skyscanner's contribution is in how they manage config across 24 clusters without forking it 24 times.

They use a GitOps pipeline where collector config is templated (they don't say with what, but Helm + Kustomize overlays is the obvious answer), with a base config that all clusters inherit and per-cluster overrides for things like environment-specific sampling rates or backend endpoints. Changes go through PR review before they reach any cluster.

This sounds obvious. In practice, almost nobody does it. The teams we work with who are self-hosting OTel for the first time almost universally manage collector config by SSHing into the pod, editing the YAML, and restarting. That's how you get drift.

Mapping This to an AWS + Grafana Stack

Here's what this architecture looks like when the backend is Grafana Mimir (metrics), Loki (logs), and Tempo (traces) running on AWS, which is the stack we deploy most often.

┌─────────────────────────────────────────────────────┐
│  EKS Cluster (×N)                                   │
│                                                     │
│  ┌──────────────┐      ┌──────────────────────────┐ │
│  │  App Pods    │─────▶│  OTel Agent (DaemonSet)  │ │
│  │  (OTLP SDK)  │      │  - attr enrichment       │ │
│  └──────────────┘      │  - k8s metadata          │ │
│                        │  - no sampling           │ │
│                        └────────────┬─────────────┘ │
│                                     │               │
│                        ┌────────────▼─────────────┐ │
│                        │  OTel Gateway (Deployment)│ │
│                        │  - tail sampling          │ │
│                        │  - metric transforms      │ │
│                        │  - log routing            │ │
│                        └────────────┬─────────────┘ │
└─────────────────────────────────────┼───────────────┘
                                      │ OTLP/gRPC
                    ┌─────────────────▼──────────────┐
                    │  AWS (central, multi-AZ)        │
                    │                                 │
                    │  Grafana Alloy (ingest layer)   │
                    │       │          │         │    │
                    │  Mimir (m)   Loki (l)  Tempo(t) │
                    │  on S3       on S3     on S3    │
                    │                                 │
                    │  Grafana (query + dashboards)   │
                    └─────────────────────────────────┘

A few specific decisions that matter here:

Why Grafana Alloy at the ingest layer instead of another OTel Collector?

Alloy (the successor to Grafana Agent) speaks OTLP natively but also has first-class support for Loki's push API, Mimir's remote write, and Tempo's OTLP endpoint. When you're running all three Grafana backends, Alloy simplifies the fan-out. You can receive a single OTLP trace stream and write spans to Tempo, extract exemplars to Mimir, and derive structured log events to Loki — all in one pipeline, with one config file, without a custom OTel connector.

If your backend is not the Grafana stack, use the OTel Collector. Alloy's advantage is specifically in the Grafana ecosystem.

Tail sampling config for a trace-error policy:

# otelcol-gateway-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 2000
      - name: probabilistic-fallback
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

This keeps 100% of error traces, 100% of traces over 2 seconds, and 5% of everything else. At 1,000 traces/sec ingress, num_traces: 50000 gives you a 50-second buffer for the decision_wait window — long enough for most distributed traces to complete.

The thing nobody tells you: decision_wait and num_traces interact badly under traffic spikes. If you get a burst that pushes you past num_traces before decision_wait expires, the processor starts making early decisions on incomplete traces. You'll see truncated traces in Tempo with missing root spans. Set num_traces to at least 3× your expected peak, not your average.

Config management with Helm:

# values-base.yaml (all clusters inherit this)
collector:
  config:
    processors:
      tail_sampling:
        decision_wait: 10s
        num_traces: 50000
        policies:
          - name: errors-policy
            type: status_code
            status_code:
              status_codes: [ERROR]
    exporters:
      otlp:
        endpoint: "${CENTRAL_OTLP_ENDPOINT}"
        tls:
          insecure: false

# values-prod-eu-west-1.yaml (cluster-specific override)
collector:
  config:
    processors:
      tail_sampling:
        policies:
          - name: errors-policy
            type: status_code
            status_code:
              status_codes: [ERROR]
          - name: probabilistic-fallback
            type: probabilistic
            probabilistic:
              sampling_percentage: 2  # higher traffic cluster, lower fallback rate

The cluster-specific override only changes what needs to change. Everything else comes from the base. When you add a new policy to the base, it propagates to all clusters on the next Helm upgrade. This is the part most teams skip, and it's the part that costs them the most time six months later.

The Cardinality Problem Skyscanner Doesn't Fully Solve

Here's where I want to be honest about what the Skyscanner architecture doesn't address.

Tail sampling handles trace volume. It does not handle metric cardinality. If a service emits a metric with a user_id label — and this happens more often than you'd think, especially after a team migrates from Datadog where custom metrics are billed per metric name, not per series — the gateway collector forwards every unique label combination to Mimir. At 100K active users, that's 100K series for a single metric. Mimir will ingest it, but your storage costs will be ugly and your query performance will degrade.

The right place to catch this is in the gateway collector, with a filter or transform processor that drops or hashes high-cardinality labels before they hit the backend:

processors:
  transform/drop_high_cardinality:
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "user.id")
          - delete_key(attributes, "session.id")
          - delete_key(attributes, "request.id")

But you have to know which labels are high-cardinality before you can drop them. The only way to know is to actually look at your Mimir cardinality explorer or run mimirtool analyze against your tenant. This is reactive, not proactive.

Grafana Labs shipped Adaptive Logs drop rules recently, which does something similar for logs — it identifies noisy log lines and generates drop rules automatically. There's no equivalent for metric cardinality in the collector layer yet. This is a gap in the current toolchain.

When This Architecture Is Wrong

Two-tier collector fleets are operationally expensive. You're running more pods, more config, more failure surfaces. Before you adopt this pattern, ask:

Do you actually need tail sampling? If your answer to "what do you do with traces" is "we look at them when something breaks," you probably don't. Head sampling at 10-20% with 100% error capture is operationally simpler and covers 90% of debugging use cases. Tail sampling is worth the complexity only when you need to correlate trace outcomes with sampling decisions — for example, if you're doing SLO analysis on trace data and need statistically representative samples.

Is your cluster count below 5? At that scale, per-cluster config management is a spreadsheet problem, not a GitOps problem. Add the complexity when you feel the pain, not before.

Are you on Fargate or Lambda? The DaemonSet agent tier doesn't work on Fargate. You'll need sidecar collectors or the OTel Lambda layer, which changes the architecture significantly. The gateway tier still applies, but the agent tier becomes per-task, which changes your cost model.

What the Numbers Look Like

For a client we migrated from Datadog APM to self-hosted Tempo + Mimir in Q4 2025, the before/after on trace infrastructure cost:

	Datadog APM	Self-hosted Tempo on S3
Ingested spans/month	~18B	~18B (same workload)
Retained traces (tail sampled)	100% (Datadog samples at ingest)	~8% (error + slow + 5% fallback)
Monthly cost	$41,200	$3,100
P99 trace query latency	340ms	85ms
Operational overhead	~0 (managed)	~6 hrs/month

The 6 hours/month operational overhead is real and you should put it in your business case. It's not zero. It's also not the 40+ hours/month that vendors imply when they warn you about "the hidden cost of self-hosting." The operational cost is front-loaded in the migration and flattens out once the GitOps pipeline is in place.

Where to Go From Here

If you're evaluating a move off Datadog APM or Splunk APM to self-hosted Tempo, the Skyscanner post is worth reading in full — it's linked in the OpenTelemetry blog and is more operationally specific than most case studies. The OTel Collector tail sampling documentation is also better than it used to be; the tailsamplingprocessor README now includes worked examples for the most common policies.

The piece that's hardest to get right without prior experience is the Helm/GitOps config structure for multi-cluster deployments. It's not technically complex — it's just easy to get wrong in ways that cause pain 12 months later. If you're starting that work and want a second opinion on your architecture before you commit to a pattern, that's the kind of review we do at Etalon. No obligation — we'd rather you get it right than call us to fix it later.