May 25, 2026 · Mihai · 10 min read

OTel Graduates, But Your Collector Config Is Still a Liability: A Production Hardening Guide

OpenTelemetry reached CNCF graduation in May 2026. That's a meaningful signal — the project is stable, the APIs are locked, and the ecosystem is mature enough that betting your observability pipeline on it is no longer a gamble. But graduation doesn't mean your OTel Collector deployment is production-ready. In practice, most Collector configs we inherit from clients are one cardinality spike away from an OOM kill, one misconfigured exporter away from silent data loss, and completely missing the backpressure controls that separate a reliable pipeline from a fragile one. This post covers the specific hardening steps we apply to every OTel Collector deployment before we'd call it production-grade on AWS.

OpenTelemetry reached CNCF graduation in May 2026. That's a meaningful signal — the project is stable, the APIs are locked, and the ecosystem is mature enough that betting your observability pipeline on it is no longer a gamble.

But graduation doesn't mean your OTel Collector deployment is production-ready.

In practice, most Collector configs we inherit from clients are one cardinality spike away from an OOM kill, one misconfigured exporter away from silent data loss, and completely missing the backpressure controls that separate a reliable pipeline from a fragile one. We've migrated eight enterprise teams off Datadog and Splunk in the last 18 months, and the pattern is consistent: the open-source stack is cheaper and more capable, but the Collector is where operational debt accumulates fastest.

This post covers the specific hardening steps we apply before we'd call a Collector deployment production-grade on AWS. The target stack is OTel Collector Contrib v0.102+, running on ECS Fargate or EC2, exporting to Grafana Mimir (metrics), Loki (logs), and Tempo (traces).

The Five Failure Modes We See Repeatedly

Before getting into config, it's worth naming the actual failure modes. These aren't theoretical.

  1. OOM kills under cardinality spikes. A deployment event introduces a new high-cardinality label (a UUID in a metric name, a full URL path as an attribute). Memory climbs. The container is killed. Metrics go dark. The team doesn't notice for 20 minutes because the alerting pipeline also runs through the Collector.

  2. Silent exporter failures. The OTLP exporter to Mimir returns a 429 or 503. The Collector logs the error and drops the batch. No alert fires. The gap shows up in dashboards hours later during an incident.

  3. Pipeline head-of-line blocking. A single slow receiver holds up the entire pipeline because the default pipeline is synchronous. Trace export backs up behind a log receiver that's waiting on a slow S3 write.

  4. No resource attribution. Spans and logs arrive in Tempo and Loki without service.name, deployment.environment, or k8s.pod.name. Every query requires guessing. RBAC in Grafana becomes meaningless because you can't scope by service.

  5. Collector as a single point of failure. One Collector per host or one central Collector per cluster, no redundancy, no queue persistence. A restart drops whatever was in the in-memory queue.

None of these are bugs in OTel. They're configuration and architecture choices that the project leaves to you.

Memory: Set Hard Limits and Use the Memory Limiter

The memorylimiter processor is not optional. Put it first in every pipeline.

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1536
    spike_limit_mib: 384

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resourcedetection, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resourcedetection]
      exporters: [prometheusremotewrite/mimir]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, batch, resourcedetection]
      exporters: [loki]

The limit_mib should be set to roughly 75% of the container's memory allocation. The spike_limit_mib covers transient bursts. When the Collector exceeds limit_mib - spike_limit_mib, it starts refusing new data and signals backpressure upstream. This is the intended behavior — it's better to drop at the edge than to OOM and drop everything.

On ECS Fargate, set the task memory limit to limit_mib / 0.75. A 2048 MiB task gives you a 1536 MiB Collector limit with 512 MiB headroom for the OS and sidecar.

We also recommend enabling the ballast extension, which pre-allocates a fixed block of memory to reduce GC pressure:

extensions:
  memory_ballast:
    size_mib: 512

Set size_mib to roughly 25% of limit_mib. This is a Go runtime trick, not an OTel-specific one — it keeps the GC from running too aggressively on small allocations and reduces CPU spikes under load.

Backpressure and Queuing: Don't Rely on In-Memory Buffers Alone

The default OTLP exporter queues data in memory. A Collector restart drops that queue. For traces and logs, this is usually acceptable — you lose a few seconds of data. For metrics, it can create gaps that break SLO calculations.

For the Mimir exporter, enable persistent queuing:

exporters:
  prometheusremotewrite/mimir:
    endpoint: https://mimir.internal/api/v1/push
    headers:
      X-Scope-OrgID: "production"
    auth:
      authenticator: basicauth/mimir
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 10000
      storage: file_storage/queue
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

extensions:
  file_storage/queue:
    directory: /var/otelcol/queue
    timeout: 10s

The file_storage extension writes the queue to disk. On ECS, mount an EFS volume at /var/otelcol/queue. On EC2, use a local NVMe volume. The queue survives Collector restarts and handles upstream unavailability for up to max_elapsed_time (5 minutes in this config) before dropping.

For Loki and Tempo exporters, we typically leave queuing in-memory but increase num_consumers to 20 and queue_size to 5000. Traces and logs have lower retry value — a 5-minute-old trace is usually not worth retrying.

Resource Attribution: Enforce It at the Collector, Not at the SDK

You cannot trust every application team to instrument correctly. Some will forget service.name. Some will set it to my-service-v2-test-DO-NOT-USE. The Collector is the right place to enforce baseline resource attributes.

The resourcedetection processor pulls attributes from the environment automatically:

processors:
  resourcedetection:
    detectors: [env, ecs, ec2]
    timeout: 5s
    override: false

With override: false, SDK-set attributes win. The detector fills in gaps. On ECS, this automatically adds aws.ecs.task.family, aws.ecs.cluster.arn, and cloud.region. On EC2, you get host.id, host.name, and cloud.availability_zone.

For attributes the environment can't detect — like deployment.environment — inject them via the resource processor using environment variables set in the task definition:

processors:
  resource:
    attributes:
      - key: deployment.environment
        value: "${DEPLOYMENT_ENV}"
        action: upsert
      - key: service.namespace
        value: "${SERVICE_NAMESPACE}"
        action: upsert

Set DEPLOYMENT_ENV=production and SERVICE_NAMESPACE=payments in the ECS task definition. The Collector stamps every span, metric, and log with these values before export. Grafana RBAC, Loki label-based access control, and Tempo's service graph all depend on these attributes being present and consistent.

Cardinality Control: Filter Before It Reaches Mimir

High-cardinality metrics are the most common cause of Mimir ingestion cost spikes. The filter and transform processors let you drop or relabel before data leaves the Collector.

A real example: a client's Java services were emitting http.server.request.duration with url.path as an attribute. Every unique URL path (including those with UUIDs) became a separate metric series. We went from ~200K active series to ~2.1M in 48 hours after a new service deployed.

The fix:

processors:
  transform/drop_high_cardinality:
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "url.path")
          - delete_key(attributes, "http.url")
          - delete_key(attributes, "user.id")

  filter/drop_debug_metrics:
    metrics:
      metric:
        - 'name == "jvm.gc.collections.count" and resource.attributes["deployment.environment"] == "staging"'

The transform processor removes the offending attributes from every datapoint before export. The filter processor drops entire metric families you don't need in production — in this case, verbose JVM GC metrics from staging environments.

Combined, these two processors reduced the client's Mimir active series from 2.1M back to ~280K, which dropped their AWS EBS cost for Mimir storage by roughly 60% month-over-month.

Pipeline Isolation: Separate Receivers Don't Mean Separate Pipelines

A common misconception: if you have separate receivers for traces, metrics, and logs, the pipelines are isolated. They're not, by default. A slow processor in the metrics pipeline can starve the traces pipeline of goroutines if you're not careful with the Collector's internal concurrency settings.

Use named pipelines and be explicit about which processors each one uses:

service:
  pipelines:
    traces/production:
      receivers: [otlp]
      processors: [memory_limiter, batch/traces, resourcedetection, resource, attributes/traces]
      exporters: [otlp/tempo]
    metrics/production:
      receivers: [otlp, prometheus/internal]
      processors: [memory_limiter, batch/metrics, resourcedetection, resource, transform/drop_high_cardinality, filter/drop_debug_metrics]
      exporters: [prometheusremotewrite/mimir]
    logs/production:
      receivers: [otlp, filelog]
      processors: [memory_limiter, batch/logs, resourcedetection, resource]
      exporters: [loki]

Naming pipelines (traces/production vs traces) is cosmetic but useful for the Collector's own telemetry — the otelcol_processor_dropped_metric_points metric is scoped to pipeline name, so you can alert on drops per pipeline.

Also configure named batch processors with different settings per signal type:

processors:
  batch/traces:
    send_batch_size: 512
    timeout: 1s
  batch/metrics:
    send_batch_size: 2000
    timeout: 10s
  batch/logs:
    send_batch_size: 1000
    timeout: 5s

Traces benefit from lower latency (1s timeout). Metrics tolerate batching for 10 seconds to reduce write amplification on Mimir. Logs sit in between.

Collector Self-Observability: Monitor the Monitor

The Collector exposes its own metrics on port 8888 by default. Scrape them with a Prometheus receiver pointed at localhost, or configure the Collector to push its own telemetry to Mimir:

service:
  telemetry:
    logs:
      level: warn
    metrics:
      level: detailed
      address: 0.0.0.0:8888

The metrics you actually need to alert on:

Metric Alert threshold What it means
otelcol_processor_dropped_metric_points > 0 for 5m Data is being dropped, usually memory pressure
otelcol_exporter_send_failed_metric_points > 0 for 2m Exporter is failing, check upstream
otelcol_receiver_refused_metric_points > 0 for 1m Memory limiter is rejecting ingest
otelcol_process_memory_rss > 85% of limit Approaching OOM
otelcol_exporter_queue_size > 80% of queue_size Queue filling up, upstream slow

We configure these as Grafana alerting rules against Mimir. The Collector's own metrics pipeline feeds into the same Mimir it's exporting application metrics to — which means if Mimir goes down, you also lose Collector telemetry. Accept this tradeoff or run a separate lightweight Prometheus instance for Collector self-monitoring.

Where This Approach Falls Short

This hardening guide assumes a relatively stable workload. There are cases where it doesn't hold:

Very high-throughput, low-latency environments. If you're processing >500K spans/second per Collector instance, the file-based queue adds measurable latency on the write path. In that case, use Amazon SQS or Kinesis as a durable buffer between the Collector and the exporters, and accept the added complexity.

Multi-tenant setups with strict data isolation. A single Collector handling data from multiple tenants with different Mimir org IDs requires careful routing logic in the routing connector. It's doable but the config becomes significantly more complex. We've seen teams underestimate this and end up with cross-tenant data leakage in Loki.

Windows-based workloads. The file_storage extension has known issues on Windows paths. If you're running OTel Collector on Windows EC2 (it happens), test the queue persistence behavior explicitly before relying on it.

Teams with no existing Collector operational experience. The hardening here adds real operational surface area. If your team has never debugged a Collector pipeline before, start with a simpler config and add complexity incrementally. A misconfigured filter processor that silently drops production metrics is worse than a naive config that's easy to reason about.

What to Do Next

If you're running OTel Collector in production today, start with the memory limiter and the self-observability metrics. Those two changes catch the most common failure modes with the least config risk.

If you're mid-migration from Datadog or Splunk, the resource attribution and cardinality control sections are where we'd focus first — they have the most direct impact on cost and query quality in Mimir and Loki.

The full reference config for the stack described here (OTel Collector Contrib v0.102, ECS Fargate, Mimir + Loki + Tempo on AWS) is something we've built out and refined across multiple client migrations. If you're working through a similar migration and want a second opinion on your Collector architecture, we're happy to take a look — etalon.systems.

Category: Observability

Comments

Leave a comment