Sizing Grafana Mimir on AWS EKS for 10M Active Series

How we size Grafana Mimir clusters for 10M active series on AWS EKS — replica counts, cost tradeoffs, and the failure modes to plan for.

Ten million active series is the awkward middle of observability scale. It is large enough that a toy three-node Prometheus setup dies quietly in production, and small enough that every vendor sales deck still claims "we handle that in our sleep." Mimir usually does handle it, if you treat the write path as a throughput job and the read path as a latency-sensitive adversary. This is how we think about sizing when a platform team asks us to land Mimir on AWS EKS at roughly 10M active series — the same envelope we describe for LGTM and migration work on our services page — with the constraint that AWS cost has to stay predictable and on-call load has to stay boring.

This is not exotic computer science. Mimir at mid-enterprise scale is well understood, and a serious self-hosted stack is often an order of magnitude cheaper than Datadog-style SaaS at the same ingestion envelope — assuming you count infra, storage, and engineer time honestly. The part that still separates teams is execution: the cluster shape that survives a bad deploy, a cardinality spike, and a CFO review in the same month.

What "10M active series" actually implies for Mimir

Active series is not a single knob. Two clusters can both report ~10M series and behave completely differently if scrape intervals differ, histogram cardinality explodes, or remote write batches arrive as micro-bursts.

For sizing discussions we normalize on a few inputs we force into the brief before we touch values.yaml:

Samples per second (rough): active series divided by scrape interval. At 15s scrapes, 10M series is on the order of 650k–700k samples/s before replication. At 30s scrapes, halve that.
Replication factor: we default to RF3 for anything that will survive a real AZ failure. That multiplies write amplification.
Query load: "metrics for dashboards only" versus "PromQL is a product feature for hundreds of engineers" changes querier and store-gateway counts more than most people expect.

If you skip that normalization, you will build a cluster that looks correct in a spreadsheet on Friday and wrong in kubectl top pods the first Tuesday after a holiday freeze.

The component map we actually size first

We treat Mimir as four different problems that happen to share a config repo.

1. Ingesters (write path, TSDB head, the money)

Ingesters are the component you'll think about most and page about most. They hold the hot TSDB head in memory, absorb WAL pressure on every write, and define the blast radius when a deploy goes wrong — which is why we treat them as the first and most important sizing decision, not the last. For ~10M series with RF3, we plan ingesters as the primary memory budget.

Ingesters: start in the 16–32 replica range for RF3 at this scale, depending on scrape interval and how spiky remote write is, each on m6i.2xlarge–m6i.4xlarge class nodes (or r6i if head memory is tighter than CPU). We prefer more smaller ingesters over fewer monsters because a 15-minute ingester restart spread across three replicas is survivable; a 40-minute restart concentrated on one oversized replica is an incident.
Distributors: horizontally scaled, usually CPU-first; think 6–12 replicas on m6i.xlarge–m6i.2xlarge as a first pass, then adjust after watching distributor CPU during peak remote write.
Compactors: sized from compaction lag and object storage churn, not from "one per AZ feels right." Under-sized compactors show up as rising object count, slower queries, and creeping S3 costs.

We once inherited a cluster from a client running twelve ingesters on c6i.4xlarge — CPU-optimized, memory-starved — which looked fine in steady state. During a Black Friday traffic spike, head compaction stalled, WAL pressure climbed past 85%, and by the time on-call paged us the ingesters were OOM-looping faster than the rollout operator could replace them. We doubled the replica count, moved the workload to r6i.2xlarge, and watched the head recover over roughly four hours. The lesson was not the instance SKU on the invoice. It was that "CPU looks fine" is a trap when the workload is memory-bound on the 99th percentile, not the mean — the same pattern we have seen when high-cardinality IoT fleets or aggressive recording rules quietly turn mean utilization into a lie.

2. Read path (queriers, store-gateways, query-frontends)

Queriers and store-gateways are where dashboard snappiness lives or dies — and where S3 GETs turn into money if you let every engineer run unbounded range queries in parallel. If you let interactive queries share fate with ingestion, you will get a very educational incident. We default to isolating queriers and store-gateways onto separate node groups from ingesters, and we rate-limit at the edge (query-frontend plus limits in Mimir) before we throw money at replicas.

3. Object storage (S3 is the database, sort of)

S3 is where retention and compaction policy become a bill you can explain — or one you cannot. Mimir's economics are compute plus S3 plus a line item people underestimate: S3 API requests. At 10M series, compaction and querying can turn you into a GET-heavy workload if you misconfigure block sizes, caching, and parallelism. When we size this layer, we are thinking about lifecycle rules, request rates during backfills, and whether Finance will recognize the pattern before Engineering does.

4. Metadata and coordination

The ring, the KV store, and whatever your chart wires for internal coordination are not glamorous — until they flap once. We do not prescribe a single internal dependency graph because it depends on Helm chart choices and Mimir version, but we budget time for it the same way we budget money: if the metadata path is flaky, your ingesters flap and your bill becomes irrelevant. Incidents here rarely look like "slow query"; they look like thundering herds and half-ready replicas.

A concrete starting Helm shape (not gospel)

This is a starting point we use in internal runbooks when a platform team needs a base for EKS plus kube-prometheus remote_write tests. You will change replicas after a week of production-shaped load.

# mimir-distributed values fragment (illustrative)
mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: s3.eu-central-1.amazonaws.com
          region: eu-central-1
          bucket_name: mimir-blocks-prod
    ingester_client:
      grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600

ingester:
  replicas: 24
  resources:
    requests:
      cpu: "3500m"
      memory: "28Gi"
    limits:
      memory: "30Gi"
  zoneAwareReplication:
    enabled: true

distributor:
  replicas: 8
  resources:
    requests:
      cpu: "2000m"
      memory: "6Gi"

querier:
  replicas: 12
  resources:
    requests:
      cpu: "2000m"
      memory: "8Gi"

store_gateway:
  replicas: 9
  resources:
    requests:
      cpu: "1500m"
      memory: "12Gi"

compactor:
  replicas: 3
  resources:
    requests:
      cpu: "2000m"
      memory: "8Gi"

rollout_operator:
  enabled: true

The important part is not the exact replica integers. It is that zone-aware replication, explicit resource requests, and grpc limits are present on day one, because "we will tune later" in Mimir usually means "we will learn about head compaction latency at 03:14."

AWS costs: what moves the needle at this scale

Below is a simplified monthly model we use in stakeholder conversations. The rough monthly cost column is an honest envelope for ~10M active series in a typical RF3, multi-AZ layout; it is not a quote for a specific bill. Before you publish externally, sanity-check against your own AWS Cost Explorer or the pricing calculator — wrong numbers cited confidently kill credibility faster than no number at all.

Line item	What drives it at ~10M series	Typical failure mode	Rough monthly cost at ~10M series
EKS control plane + node compute	Ingester and querier replica counts, instance families	Too few nodes, CPU throttle, remote write 503s	~$8K–$14K
EBS (gp3 / io)	WAL volume, ingester local SSD vs network storage choices	Wrong volume latency class, ingester stalls	~$400–$900
S3 storage	Retention, compaction health, symbol/table churn	Compaction behind, object count explodes	~$600–$1,500 (12-month retention, compacted)
S3 requests	Compaction, querier block fetches, listing patterns	"Cheap storage, expensive API tax"	~$300–$2,000 (highest variance — parallel queries and bad cache behavior rebuild this line fast)
Data transfer	Cross-AZ replication, query fanout	RF3 + bad placement, inter-AZ bill surprise	~$1,500–$4,000

Total order of magnitude: roughly $12K–$25K/month of AWS spend at this envelope, versus $80K–$200K+/month for equivalent SaaS-style coverage when you include ingestion, querying, and retention the way commercial vendors price it. The range is wide on purpose: S3 request patterns and cross-AZ replication are where teams accidentally rebuild the SaaS bill in AWS currency.

What we validate in week one (before we argue about final replica counts)

We do not trust static sizing. We trust three graphs and one load test.

Ingester memory vs series growth — head growth and restart time. If ingesters climb toward limits without a matching series increase, we look for cardinality shifts, histogram explosions, or bad recording rules.
Distributor CPU vs remote write spikes — Prometheus remote write can arrive in bursts even when mean rate looks polite. Distributors are the pressure relief valve.
Compactor backlog — if compaction lag grows linearly after a feature release, you do not have a mystery. You have a throughput mismatch.
Query latency percentiles split by route — slow queries that correlate with store-gateway CPU or S3 GET rate tell you whether you need more store-gateways, better caching, or fewer parallel touches of the same blocks.

Where Datadog comparisons help, and where they hurt

At this scale Datadog (and peers) are often an order of magnitude more expensive than a disciplined AWS+Mimir stack when you include ingestion, indexing, and retention behaviors that SaaS pricing optimizes for.

The honest caveat: self-hosted is cheaper only if you staff it. Mimir is not "free," it is transferring spend from vendor invoices to EKS nodes and platform engineering hours. If your organization cannot keep upgrades on a schedule, you will pay in incidents instead.

This is not the right approach when

You cannot control cardinality at the source, and politically nobody can say "no" to unbounded label sets. Mimir survives a lot, but it cannot repeal information theory.
You need multi-region active-active querying with naive expectations. You can build it, but it is not the default Helm install.
Your primary win condition is log analytics, not metrics — in which case Mimir is the wrong tool, even if Grafana is on every screen.

Where to go next

If you are evaluating Mimir at mid-enterprise scale, the next practical steps are boring on purpose: capture real scrape intervals and sample rates, load test remote write with burstiness, and model S3 request growth under compaction, not just GiB stored.

We build and migrate Grafana Mimir stacks on AWS EKS for teams that are done paying SaaS prices but refuse to run a science project in production. If you want a second pair of eyes on your sizing spreadsheet before you commit to node groups and retention, Etalon (etalon.systems) is the kind of shop that argues with your Helm values until they match reality.