Migrating CloudWatch Logs to Grafana Loki on EKS: architecture, retention, and cost

Buyers who type this query into Google want three things: a defensible architecture, retention they can explain to Finance, and a cost model that includes S3 and query path, not only ingest. We start from a real EKS + Loki layout we ship for AWS-native shops.

Teams start this migration with two fears: losing CloudWatch's "it just exists" integration with IAM and control plane, and discovering that Loki plus S3 is cheap on paper but expensive in S3 API requests when retention, compaction, and query patterns are wrong. A third fear shows up in audits: a gap in log availability during the window when subscription delivery, normalizers, or Loki ingest lag while someone still expects the old console search to be authoritative.

This post is for platform leads who need a defensible architecture, retention they can explain to Finance, and a cost model that includes the query path, not only ingest GiB. Nothing here replaces capacity planning on your volumes; it replaces hand-waving.

When this migration is worth the conversation

CloudWatch Logs bills on ingestion, storage, and increasingly on analysis and cross-service movement. At low volume it is rational operational tax. At high volume or long mandatory retention, the line item becomes a budget owner.

Grafana Loki on EKS with S3 (or S3-compatible storage) shifts economics toward object storage plus Kubernetes compute, which you already operate. The trade is operational ownership: you own compaction, cardinality discipline, and the blast radius when someone runs a brutal LogQL range query across every namespace.

We recommend this path when the organization is already committed to Grafana for dashboards and alerting, wants one query language adjacent to Prometheus metrics, and can enforce label discipline on log streams. We recommend against treating Loki as a drop-in replacement for ad hoc full-text forensics across unstructured blobs without governance; in that world you either accept Elasticsearch-class costs somewhere else or you spend quarters fighting behavior.

Reference architecture: CloudWatch to Loki on EKS

There are two common shapes. Most enterprises we see start with Shape A because applications already log to CloudWatch today.

Shape A — Keep CloudWatch as the primary sink for a transition period

Sources: application logs, awslogs driver, Lambda, ECS task logs, and anything else already landing in log groups.
Delivery: subscription filters on log groups (or Kinesis Data Firehose when you need buffering, transformation, or higher throughput with backpressure you can reason about). Filters are simpler; Firehose is heavier but scales and integrates with transforms.
Normalizer: small Lambda or a minimal service on EKS that validates structure, drops noisy fields if policy allows, and writes Loki-native labels consistently. At very high volume, Lambda per-invocation limits and cost may push you toward a stateless forwarder on EKS with autoscaling.
Loki on EKS: distributors receive pushes; ingesters hold recent data and flush chunks; queriers (and query-frontends if you use them) serve Grafana and API clients; compactors merge chunks and drive retention enforcement on the object store.
Storage: S3 for chunks; index shipper pattern (for example boltdb-shipper or the chart-default for your Loki major version) unless scale or latency requirements push you to a dedicated index store. IAM is scoped per bucket prefix; cross-account ingestion or query is decided in the architecture doc, not during the security exception queue.

Shape B — New workloads bypass CloudWatch for application logs

OpenTelemetry Collector or Fluent Bit on nodes ships directly to Loki (or to a gateway), while control plane and AWS-managed surfaces stay in CloudWatch. Shape B reduces double-billing and subscription fan-out, but it only works if every team agrees on the agent chain and you still solve org-wide search for the AWS-native streams Finance expects in the old console.

Most migrations we run use Shape A first, then narrow CloudWatch to AWS surfaces and compliance streams while application logs move to Shape B team by team.

Retention: two knobs, one political problem

Retention is not a single number in Loki. You are balancing:

Legal and regulatory minimums (what must exist, in what form, for how long).
Engineering habit (what people actually query in the last hours versus what they think they need for twelve months).

Operational knobs we set explicitly in design reviews:

Concern	What to configure and monitor
Hot versus cold perception	Recent data on ingesters feels "fast"; older data is S3-backed and query latency rises with time range and parallelism.
Global retention	Loki limits and compactor behavior must align with S3 lifecycle rules so you do not fight yourself (objects deleted out of band under active index references).
Per-tenant or per-stream caps	Separate tenants or limits for noisy services so one team's JSON spam does not evict another team's audit trail.
Query-driven cost	Long retentions are cheap until everyone runs weekly "select everything" habits. Rate limits and query timeouts are retention-adjacent controls.

What Finance wants is a sentence: "We retain X days online for investigation, Y days in object storage for compliance, and we do not pay for full indexing on every field." Loki helps with that sentence only if compactor lag and lifecycle rules are tested like any other data plane.

Cost model: paste these rows into your spreadsheet

Do not compare CloudWatch ingest GiB to Loki ingester CPU in isolation. Model at least the following monthly drivers (use your AWS pricing and observed request rates):

Cost driver	What to measure	Typical surprises
EKS compute	Ingester, distributor, querier, compactor, optional query-frontend requests and limits	Under-sized compactors show up as rising S3 object count and slower queries before CPU graphs scream.
S3 storage	Chunk and index GiB after compression	Replication and non-current versions if versioning is on for the wrong reasons.
S3 requests	PUT during ingest and compaction; GET/LIST during queries and backfills	Broad LogQL over long windows turns into LIST amplification; backfills replay the same tax.
Data transfer	Cross-AZ traffic between agents, Loki pods, and S3 endpoints	"Logs are cheap" dies here if queriers and store gateways chat across AZs without topology hints.
Egress	Grafana Cloud or VPN readers pulling large result sets	Often forgotten in "self-hosted is free" spreadsheets.
Human time	Runbooks, paging, and dual visibility during migration	Real money; omit it and the CFO sends everyone back to the renewal.

CloudWatch's advantage is line-item simplicity. Loki's advantage is separation of ingest, storage, and query so engineering behavior shows up on the bill you own. If you do not assign an owner to query SLOs and S3 request dashboards, you will recreate vendor economics inside your own account.

Failure modes we rehearse before cutover

IAM and trust policies: cross-account log delivery and IRSA for Loki pods are easy to get almost right. Almost right is throttling and silent drops.
Subscription filter limits: per log group and per region quotas; burst traffic during deploys can lag delivery. Firehose buffering metrics deserve alarms.
Log structure drift: one service ships JSON, another ships unstructured lines; labels differ per deploy. Without a schema or label contract, cardinality explodes and ingesters refuse streams or Grafana becomes unusable.
Dual truth: during migration, CloudWatch and Loki disagree for minutes after incidents. Runbooks must state which source is authoritative for which workload class and for audit evidence.

Where to go next

If you are researching CloudWatch Logs to Grafana Loki on EKS with buyer intent, the next step is not another blog post — it is a two-week discovery that pins volumes, retention sentences, and query personas against a real architecture sketch. Etalon builds this path for AWS-native enterprises that want to leave vendor lock-in without leaving operational reality at 03:00.