OpenTelemetry Collector tail sampling at ~50K spans per second on AWS

Tail sampling is where OpenTelemetry configs go to die: policy mistakes show up as missing traces in prod, not as a failing unit test. This is a working collector shape we use when buyers search for opentelemetry tail sampling config and need something that survives real AWS load.

Tail sampling is where you decide whether to keep a whole trace after enough spans have arrived to judge it. That is the right place for rules like "always keep errors" and "keep slow paths," because you are looking at the trace as a unit. It is also the wrong place to get clever with YAML you copied from a tutorial, because dropped spans never come back. There is no failing unit test that says you kept 0.1% of incidents last Tuesday.

We see buyer-intent searches for opentelemetry tail sampling config land on pages that show five lines of YAML and call it done. Real traffic at tens of thousands of spans per second on AWS is a different problem: memory for incomplete traces, autoscaling signals that are not request rate, and exporters that stall when Grafana Tempo or a vendor OTLP endpoint applies backpressure. Below is the collector shape we actually run when someone needs tail sampling to survive production, not a demo.

What tail sampling fixes (and what it breaks)

Head sampling (including probabilistic sampling at the SDK) is cheap and stateless. You flip a coin per span or per root and you are done. The downside is obvious: you might drop the child span that would have told you why the root was slow, or you might keep a mountain of healthy traffic because the coin said yes.

Tail sampling defers the decision until the collector has seen enough of the trace to apply policies: status codes, latency against a threshold, attribute matches, composite rules. That is the behavior platform teams want when they migrate off hosted APM and need to cap Tempo cost without blind random deletion.

The failure mode is silent. Mis-tuned decision_wait means you decide too early and mis-classify. Too-small trace buffers mean you evict incomplete traces under load and bias your sample. Wrong HPA metrics mean you scale on HTTP-ish signals while memory explodes. None of that shows up as a red checkmark in CI.

Collector layout at roughly 50K spans per second on AWS

At this throughput we treat the OpenTelemetry Collector as a dedicated gateway tier on Amazon EKS, not as a sidecar that also happens to sample. Agent or DaemonSet collectors on nodes receive OTLP, batch, optionally apply cheap filters, then forward to one or more gateway pods that own tail_sampling.

Why split the path:

Ingress wants high fan-in, aggressive batching, and tight CPU profiles. Tail sampling wants heap headroom for the trace map.
Restarting a gateway during a rollout should not mean every node loses its receive path if you design load balancing (gRPC or HTTP with consistent hashing, depending on how you shard).
You can set Pod memory limits on the gateway higher than the DaemonSet without over-provisioning every node.

We run Horizontal Pod Autoscaler on signals that correlate with tail sampler pressure: CPU and export queue depth (via collector internal metrics exposed to Prometheus or Grafana Mimir), not on a naive "requests per second" proxy that ignores span size inflation after a framework upgrade.

Kubernetes resource blocks are boring and worth spelling out. A gateway that tail-samples at this rate with Java-style span payloads is not the same pod as one that only sees Go services with skinny spans. We set explicit limits and requests, watch OOMKilled during soak tests, and keep Grafana dashboards on RSS and GC time for the JVM-based collector builds if a client pins one.

A policy fragment that is a starting point, not a certificate

The OpenTelemetry Collector tail_sampling processor has knobs that trade latency for correctness and memory for safety:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 5000
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 2000
      - name: keep-small-random
        type: probabilistic
        probabilistic:
          sampling_percentage: 2

decision_wait: How long to hold spans for a trace before evaluating policies. Longer waits reduce wrong decisions on late spans; they add end-to-end latency to trace export and increase memory held in the map.
num_traces: Upper bound on traces tracked concurrently. When the map is full, behavior depends on version and settings; treat this as a capacity plan, not a comment.
expected_new_traces_per_sec: Helps internal sizing. If you lie to it, you do not get a compile error. You get odd eviction patterns under burst traffic.

We still load-test after major OpenTelemetry SDK upgrades. Span counts per request creep up silently when someone enables another auto-instrumentation module.

Putting tail sampling in a full trace pipeline (gateway)

A minimal service section that matches how we wire gateways before adding vendor-specific exporters:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1800
    spike_limit_mib: 350
  batch/traces:
    timeout: 5s
    send_batch_max_size: 8192
    send_batch_size: 1024
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 5000
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 2000
      - name: baseline-random
        type: probabilistic
        probabilistic:
          sampling_percentage: 3

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability.svc.cluster.local:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch/traces]
      exporters: [otlp/tempo]

memory_limiter before tail_sampling is intentional. Without it, one bad client that triples span volume can take the pod down before autoscaling reacts. The order and the Mib numbers are cluster-specific; the pattern is not.

We often add a separate pipeline for traces that must never be sampled (security audit hooks, payment flows) with routing by attribute. That duplicates some config. The alternative is "forgot to exclude PCI traffic from probabilistic drop," which is worse.

Head plus tail: the combination we actually recommend

Tail sampling alone is not a magic cost switch. If you feed the gateway every span from high-volume health checks and static assets, you still pay CPU parsing OTLP and RAM holding traces you will almost certainly drop.

We routinely pair:

Probabilistic or consistent head sampling in the SDK or at the DaemonSet (modest percentage, enough to keep baseline coverage).
Tail policies on the gateway that guarantee errors, SLO violations, and tenant-specific keeps.

That way the tail processor sees a reduced arrival rate, but policies still see complete enough traces for decisions you care about. The honest downside: two layers of sampling confuse debugging ("why is this trace missing?" can be head or tail). You need documentation in the repo, not only a Grafana panel.

Exporter backpressure and Tempo ingestion

When Tempo or an OTLP vendor applies backpressure, the gateway's export queues grow. Tail sampling keeps consuming memory for open traces and you hold batches waiting to leave. This is where "scale on RPS" HPA configs die: CPU looks fine while the process is GC-thrashing or blocking on full channels.

Mitigations we use in practice:

Dedicated distributor / ingest path sizing on Tempo (this is a Tempo ops topic, not a one-line fix).
Batched export tuned to your span size (large send_batch_max_size is not free RAM).
Secondary exporter or drop policy only after explicit product agreement (some teams prefer a dead-letter queue pattern over silent drop).

If your postmortem needs every trace from a bad deploy, no sampling story will save you. Retention and search limits still matter after the collector.

What we graph before we trust a new gateway rollout

Tail sampling bugs are easier to prevent than to explain to leadership after a Sev1. Before we raise traffic on a new gateway config, we wire collector self-telemetry into the same Mimir (or Prometheus) the rest of the platform uses. Concretely, we watch:

Exporter queue size and send failures for the Tempo OTLP exporter (stalls show up here before users open tickets).
Processor dropped spans and batch send duration (if batching backs up, tail decisions get delayed and memory rises).
Process RSS against Kubernetes limits (HPA on CPU alone misses the classic "GC keeps CPU low while heap is maxed" pattern).

We still run a synthetic trace generator in staging at a fraction of prod rate but with production span size distributions captured from anonymized samples. Guessing "about 50K spans per second" from dashboard averages is how you discover that Black Friday doubles child spans per checkout.

Trace affinity and load balancing (the footgun)

If multiple gateway pods sit behind a Kubernetes Service without any stickiness, gRPC from node agents may hop between pods. Spans for one trace then land on different gateways. Each pod holds a fragment; tail_sampling never sees a complete trace, and your "keep errors" policy becomes random noise.

Fixes depend on your ingress path: consistent hashing on a trace-identifying attribute at the proxy, dedicated gateway per AZ with topology hints, or a single gateway pool sized large enough that you accept operational simplicity over horizontal fan-out. We pick per client based on failure tolerance and AWS network layout, not from a default Helm chart comment.

The point is not to memorize one topology. The point is that tail_sampling correctness is a distributed systems problem once you have more than one gateway replica.

Numbers in one place

Knob	What goes wrong if you ignore it
`decision_wait` too short	Late spans arrive after you decided "drop"; you lose error evidence or keep junk.
`decision_wait` too long	Export latency grows; incident dashboards feel "sticky."
`num_traces` too small	Eviction under burst; sample bias toward short traces.
`num_traces` too large	OOM on gateways; noisy neighbor traces dominate RAM.
HPA on RPS only	Under-scales on fat payloads; over-scales on keep-alive noise.
No `memory_limiter`	Single bad service can starve the pod before HPA sees CPU.

When this layout is the wrong approach

Tail sampling at the gateway is the wrong default if:

You are still on OpenTelemetry Collector versions that predate fixes you rely on for your exact policy set. Pin versions in prod and read release notes.
Your traces are mostly long-lived (streaming jobs, batch pipelines with hour-long roots). decision_wait stops being a small constant; you need different policy design or head sampling first.
You have no one owning collector config after migration. YAML in git with no on-call rotation ends in March surprises.

If you only need "10% of traces" with no business rules, probabilistic head sampling alone is simpler. Tail sampling earns its complexity when errors and slow requests must never be probabilistically discarded.

Where to go next

We will cover probabilistic head sampling in more detail, OTLP exporter tuning under load, and Tempo distributor limits in follow-up notes. For opentelemetry tail sampling config that has to live next to real EKS invoices, start from memory, queues, and span cardinality, not from a single policy block.

If you are migrating off Datadog, Splunk, or similar and need collectors that survive production traffic rather than demo YAML, Etalon is the blunt entry point. We are a Bucharest-based consultancy shipping Grafana, Loki, Tempo, Mimir, and OpenTelemetry in customer AWS accounts, with Terraform and GitOps handoff, not slide-only recommendations.