Observability versus monitoring: runbooks, unknowns, and OpenTelemetry on AWS

Monitoring is the coverage you wired in advance. Observability is whether you can answer the next question without shipping a new metric for every hypothesis. Here is how we use both when we move teams to Grafana LGTM and OTel on AWS.

For a decade, "monitoring" meant Nagios-style checks, a Grafana folder of golden signals, and paging when a threshold crossed. That still matters. What changed is failure shape: a checkout path might touch forty internal calls, three feature flags, and two cache layers nobody listed when someone wrote the original runbook. Monitoring answers "did the probe turn red?" Observability is whether you can answer a new question with data you already shipped, without redeploying a custom metric for every hypothesis.

We use both words on client calls. The confusion is not academic. It decides budget (another SaaS seat versus a platform hire), instrumentation (OpenTelemetry everywhere or log lines only?), and on-call (wake someone for a synthetic miss or for a user-visible trace tail?). Below is how we separate the two when we migrate teams off Datadog-class stacks onto Grafana Mimir, Loki, Tempo, and OpenTelemetry on AWS.

Monitoring: deliberate coverage of known failure modes

Monitoring is signals you chose in advance, plus routes when they breach. Examples we still ship in production after a migration:

Black-box synthetics against /health and a canary purchase flow
Prometheus (or Mimir) rules on kube_pod_status_ready, queue depth, replication lag
Disk and certificate expiry checks on stateful sets

That is the classic "known unknowns" posture: you listed risks, you instrumented them, you alert when the model breaks. It is cognitively cheap because the question is fixed ("is Postgres replication more than sixty seconds behind?").

Monitoring stays the right default when the system is small, compliance wants fixed evidence, or you genuinely have one runtime and a short list of dependencies.

Observability: high-cardinality exploration when the model breaks

Observability is not marketing for "more logs." In our projects it means three correlated pillars you can move across in one UI session:

Metrics for rates, errors, duration, saturation (Prometheus remote write into Mimir at scale)
Structured logs for context when you know the service neighborhood (Loki with sane labels)
Traces for end-to-end latency and dependency truth (Tempo or vendor APM, with sampling you can defend)

The practical test we give platform teams: pick a real incident ticket. Can an engineer, in under fifteen minutes, go from a user-visible symptom to a service boundary and a deploy without asking another team to kubectl cp raw files? If yes, you are observable enough for your current architecture. If every sev-2 starts with "who owns pod X and which filebeat path," you are still monitoring-shaped with expensive storage.

What actually changed around 2018–2020

Three engineering shifts made "dashboards plus logs" insufficient at the same time Kubernetes went mainstream:

Service fan-out turned every user request into a small distributed graph.
Deploy cadence made static thresholds lie (Friday's normal is Monday's regression).
Vendor bills tied log indexing and custom metrics to cardinality, which exploded when product teams added user_id to everything "temporarily."

Monitoring tools could page; they could not always explain without a human stitching five UIs. Observability tooling (especially with OpenTelemetry context propagation) exists to shorten that stitch.

A concrete workflow: histogram alert to trace tail

Suppose p99 latency for checkout jumps from 180ms to 620ms on http.route="/checkout". A monitoring-only stack might page on an SLO burn alert. The observable follow-through is:

Confirm the regression in Mimir with histogram_quantile over five-minute windows.
Open Tempo (or Grafana Explore) filtered on service.name="checkout" and high-duration traces.
Identify the slow child span (often grpc to inventory, or GET to a feature-flag SDK).
Correlate to Loki with trace_id in log lines if you still need payload context.

Illustrative recording rule fragment (Prometheus or Mimir compatible):

groups:
  - name: checkout_latency_anchor
    interval: 30s
    rules:
      - record: job:checkout_http_server_duration_seconds:p99_5m
        expr: |
          histogram_quantile(
            0.99,
            sum by (le) (
              rate(http_server_duration_seconds_bucket{job="checkout"}[5m])
            )
          )

The YAML is not the product. The product is that histogram buckets let you ask "which quantile moved?" without minting a new counter per route every sprint.

Monitoring versus observability: a blunt comparison table

Dimension	Typical monitoring posture	Observable posture we push for on AWS
Primary question	"Are the probes green?"	"Why did latency shift for this cohort?"
Instrumentation	Checks, golden metrics, curated logs	OTel SDKs, bounded labels, trace sampling policy
Cost driver	Per-host or per-seat packaging	Ingest, retention, query, and cardinality you own
Failure it misses	Novel cross-service coupling	Ungoverned log volume, sampling gaps in traces
Best owner	NOC or centralized SRE	Platform + service teams with label budgets

Failure modes we have seen after saying "we are observable now"

Trace sampling set to "sample everything in staging" and forgotten in prod. Bill surprises or missing tails when you need them.

Logs with ten dimensions of high cardinality written "for searchability." You rebuilt Elasticsearch economics inside Loki.

Dashboards that only leadership uses. Engineers still SSH. Observability is behavioral, not a license SKU.

Monitoring abandoned because someone read a blog post saying it is obsolete. Synthetics and fixed checks still catch DNS, TLS, and route mistakes that traces never see.

When observability-first is the wrong bet

If you have no platform team, no appetite for upgrades, and no governance on instrumentation, you will spend savings on incidents. The comparison is unfair if you pit "fully managed SaaS with contractual SLAs" against "three Helm charts and optimism."

If your architecture is honestly two monoliths and one database, aggressive trace-first spend is usually premature. Monitoring plus good logs might be enough until merge frequency or team count forces decomposition.

OpenTelemetry collector: where monitoring and observability meet in config

Teams often ask whether the OpenTelemetry Collector is "the observability layer." In practice it is a policy and routing choke point: receive OTLP from apps, apply tail sampling, batch spans, redact attributes, then export to Tempo and Mimir remote write endpoints. That is where you encode decisions that are neither classic monitoring nor pure exploration.

A trimmed processors + exporters shape we iterate on for AWS (illustrative, not a drop-in production file):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
    send_batch_size: 8192
  memory_limiter:
    check_interval: 1s
    limit_mib: 800
    spike_limit_mib: 200

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability.svc:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: https://mimir-gateway.observability.svc/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

If you mis-size memory_limiter or batching, you trade tail latency inside the collector for stability. If you never configure tail sampling, you pay storage like a lottery ticket. Neither mistake shows up on a monitoring dashboard until the monthly bill or a missing trace during an audit forces the conversation.

Architecture note for AWS and data residency

When buyers in the EU evaluate Grafana stacks, data residency shows up in the same RFP as technical requirements. Running EKS in eu-central-1 or eu-west-1, with S3 buckets and KMS keys in the same jurisdiction, is table stakes for the migrations we lead from US-hosted SaaS. Monitoring can stay green while compliance is red if traces and logs land in the wrong region by accident. Observability work includes Terraform modules and bucket policies, not only Helm values.

Language discipline: words we avoid in client docs

We try to keep the same bar as our blog_prompt.txt house style: no filler verbs like "unlock" or "empower," no "journey" metaphors for outages, no rhetorical questions as headings. The goal is that a VP can forward the PDF to their head of platform without apologizing for tone.

SLOs sit in the overlap (and that is fine)

Service level objectives are technically monitoring artifacts: you predefine a budget and alert on burn. In practice, SLO implementation is deeply observability-shaped because good SLOs require histograms, dependency-aware error classification, and trace-backed exemplars when someone asks "which dependency burned the budget?" A counter that only says up==1 is monitoring. A burn alert on histogram_quantile(0.99, ...) with a drill-down into traces is both.

We still see teams ship twenty-page SLO documents with three Grafana panels. Paper SLOs do not shorten incidents. Executable SLOs backed by Mimir queries and trace filters do.

Runbooks: the human layer neither word replaces

Neither monitoring nor observability removes the need for runbooks that say who owns rollback, how to drain traffic, and which compliance steps require legal sign-off. Telemetry tells you what broke; process tells you whether you are allowed to fix it live. The best migrations we run link runbook steps to specific Explore queries so on-call does not improvise PromQL during adrenaline.

Where to go next

If you are standardizing on OpenTelemetry, LGTM on EKS, and a retention model Finance can reproduce in a spreadsheet, the hard work is sampling, naming, and ownership, not another dashboard pack. At Etalon we start from invoices and cardinality, then map signals to incidents we have actually lived through. If that matches where your platform team is stuck, our services page is the blunt entry point.