Logs feel definitive because they contain words. Traces feel modern because they draw pretty waterfalls. Metrics can look boring: a counter, a histogram, a gauge. In production systems at the scale we work (Kubernetes on AWS, Prometheus remote write into Mimir, terabytes per day into Loki), metrics are still the cheapest signal per decision when they are designed with cardinality discipline. Ignore them and you either overpay for log storage or fly blind on saturation until something hard-fails.
This post is for the engineer who inherited a "logs-first" culture and wonders why Finance keeps asking about observability spend, and for the VP who wants a straight answer without vendor poetry. If you already run disciplined OpenTelemetry and Mimir, you can skim for the governance bits; if you are mid-migration, read the cardinality section twice.
Why metrics win on cost and speed at query time
A histogram series aggregated across pods answers "what is p99 doing?" across thousands of instances with a PromQL expression. The same question from raw logs often implies full-text scan behavior unless you invested heavily in indexing and still accept tail latency.
Cardinality is the price knob. A metric labeled only by service, region, and http_route might be thousands of series. Add user_id and you are in the billions. Metrics force you to face that math early because Prometheus and Mimir will tell you (sometimes loudly).
Logs and traces are essential for narrative ("what exactly did we enqueue?"), but metrics are how you run SLOs, autoscaling, and capacity planning without bankrupting the query path.
The three roles metrics still own in 2026
1. SLOs and error budgets. sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) is boring until it is the only number leadership and engineering agree on during an incident. Logs rarely substitute for a clean ratio over time.
2. Saturation signals CPU alone lies for JVM, Go, and Node workloads. Memory, disk IO, and file descriptors still show up first in metrics before they show up as human-readable log lines. Node exporters and cAdvisor-style metrics remain the backbone of "is the cluster healthy?"
3. Correlation hooks for traces and logs. trace_id in logs is useless if you cannot find the surrounding error rate spike in metrics to know which service to open first. Metrics are the map; traces are the street view.
A minimal PromQL toolkit we expect teams to know
Not exhaustive, but if your on-call rotation cannot read these patterns, training will save more money than another dashboard:
# Request error ratio (adjust job labels to your convention)
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# p99 latency from a histogram (route label optional)
histogram_quantile(
0.99,
sum by (le, job) (rate(http_server_duration_seconds_bucket[5m]))
)
If those queries timeout, the fix is rarely "buy a bigger Grafana." It is label cardinality, scrape cardinality, or range window abuse.
When metrics-first steering is wrong
Highly variable, low-volume event streams (security audits, monthly batch reconciliations) sometimes belong in logs or a warehouse, not in TSDB cardinal explosion.
Exploratory analytics ("show me all properties of this cohort for Q3") is a BI problem. Shoving those dimensions into custom metrics is how you recreate a bad data warehouse on the hot path.
Early debugging on a laptop: fmt.Printf still exists. Production is different.
How metrics, logs, and traces should split the work
| Question type | Best first signal | Why |
|---|---|---|
| "Are we out of CPU or memory budget?" | Node and container metrics | Cheap, high frequency |
| "Did error rate rise for service X?" | RED metrics from OTel | Fast slice across pods |
| "Which dependency blew the budget?" | Trace span metrics + exemplars | Localizes fan-out |
| "What payload confirmed the bug?" | Structured logs | Text is expensive at volume |
Failure modes we see when teams deprioritize metrics
Log volume becomes the observability strategy. Finance notices. Query latency notices sooner.
Tracing without baseline metrics. You can see a slow span and still miss that the cluster was throttling because nobody charted CPU throttling metrics.
Custom business metrics in the same TSDB as infra metrics without naming conventions. revenue_usd_total next to container_cpu_usage_seconds_total without namespaces ends in tears and accidental joins.
Cardinality budget: the one governance rule that matters
We tell clients to treat label sets like public API: adding a label is a semver bump. Example policy fragment (human process, not code):
- Tier A (infra):
cluster,namespace,deployment,podwhere the TSDB can handle churn - Tier B (app):
service.name,http.routewith an allowlist - Tier C (never in labels):
user_id,request_id, free-form URLs
Put request identifiers in logs and traces, not in metric labels, unless you enjoy paging the Mimir team at 03:00.
Scrape interval and sample rate: the arithmetic every SRE should do once
If you scrape a target every 15 seconds, each time series produces roughly 5,760 samples per day before replication. At 30 seconds, halve that. When someone asks for 1-second scrapes "for smoother graphs," they are asking for thirty times the write path load versus a 30s interval for the same label set. That is not a moral judgment; it is capacity planning. Mimir and Prometheus both punish cute intervals without headroom.
Remote write batches matter too. Micro-bursts from poorly tuned remote_write queues show up as ingester CPU cliffs long before dashboards look "wrong." Metrics discipline includes how often you ship points, not only how many labels you attach.
Exemplars: tying histograms to traces without labeling every request
Exemplars (histogram metadata that points at example trace IDs) are the compromise we push when product wants "per-request latency" but TSDB cardinality says no. Grafana can jump from a latency spike to a trace using exemplars when the backend supports them. That keeps aggregate metrics cheap while still giving a path into Tempo for deep dives.
RED versus USE: still the vocabulary we teach first
For services, RED (rate, errors, duration) stays the default dashboard skeleton from the Google SRE playbook. For nodes and datastores, USE (utilization, saturation, errors) stays the Brierson-style checklist. Logs and traces sit on top of those layers; they do not replace them. Teams that skip straight to trace waterfalls without RED charts usually rediscover why error rate is the fastest global filter during an outage.
Anonymized war story: the label that cost six figures in spirit
We once saw a tenant_slug label added to HTTP server metrics "temporarily" for a multi-tenant rollout. Cardinality went from ~120k to over 4M active series in a weekend because every integration test tenant minted a slug. Mimir did not crash immediately; billing and compaction lag did. The fix was code and config, not hardware. The lesson belongs in a metrics post because no log line volume caused it; unbounded label cardinality did. That class of mistake is why we still argue metrics matter even when logs feel safer to product teams.
Recording rules: cheaper dashboards, clearer alerts
Recording rules pre-aggregate expensive expressions so dashboards and alerts hit smaller series counts. They are not free (compaction and rule evaluation CPU exist), but they beat letting every Grafana user paste a five-way histogram_quantile join into a refresh-heavy panel. We treat recording rules like APIs: owned, reviewed in CI, deleted when the product metric dies.
Example pattern for a team-scoped error ratio you might record once and reuse everywhere:
groups:
- name: platform_recording
interval: 30s
rules:
- record: job:request_errors:rate5m
expr: sum by (job, team) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:request_total:rate5m
expr: sum by (job, team) (rate(http_requests_total[5m]))
Downstream alerts then divide recorded series instead of recomputing from raw high-cardinality inputs on every evaluation tick.
kube-state-metrics and cadvisor: boring metrics that prevent expensive incidents
Before anyone talks about distributed tracing, we check whether kube-state-metrics and cAdvisor-style container metrics are actually scraped with sane intervals. Pod restart storms, ImagePullBackOff loops, and CPU throttling show up here first. They are not glamorous signals; they are the ones that keep node pools from becoming mystery money. If your "observability strategy" skips them because leadership only wants "customer-facing" dashboards, you will pay in AWS autoscaling noise later.
Alerting on metrics first, logs second
Alert rules should prefer rates, histogram quantiles, and SLO burn over "log contains string ERROR" whenever human triage allows it. Log-based alerts are valid for security signatures and audit anomalies; they are expensive and noisy as generic health checks. Metrics-first paging keeps wakeups correlated with user-visible failure modes instead of log pipeline lag masquerading as an application incident. Your on-call rotation will not thank you for the alternative, and your error budget graphs will look honest faster.
Where to go next
If you are standardizing on OpenTelemetry and Grafana Mimir but your metric namespace already looks like a garage sale, we fix that before we talk about "AI observability." Etalon is where we describe how we run those migrations on AWS for teams exiting proprietary stacks.