Loki, Elasticsearch, and ClickHouse for log storage at 2TB per day: what we actually recommend

Between one and five terabytes of logs per day, every storage engine becomes a finance and operations problem, not a benchmark chart. Here is how we compare Grafana Loki, Elasticsearch, and ClickHouse after running all three in production migrations off commercial SaaS, including when we steer clients away from Loki.

If you assume we always sell Grafana Loki because we migrate enterprises onto Grafana stacks, you will misread this post. We have run Loki, Elasticsearch (and OpenSearch), and ClickHouse for log storage at ingest envelopes between one and five terabytes per day, and the honest answer is often "two of the above," not a single winner. The reader objection we hear in every boardroom is fair: we are a Grafana-stack consultancy. That is exactly why we publish the cases where Loki is the wrong tool. Credibility here is not neutrality. It is predictable recommendations tied to constraints we can defend with invoices and on-call rotations, not with a feature matrix from a vendor PDF.

At two to five terabytes per day, every engine stops being a benchmark argument and becomes a finance plus operations problem. Compression ratios, hot retention, query parallelism, and S3 request economics matter more than whether a grep-shaped query feels fast in a demo.

Axis 1: Ingestion cost per terabyte per day (what actually moves the bill)

We stop teams from comparing "list price of storage" and call that a cost model. At this scale the bill is a bundle: compute for ingest and compaction, hot block storage, object storage for warm and cold, egress you forgot to model, and API request rates on the query path that show up two quarters later.

Grafana Loki (single binary or microservices mode, object storage backend) tends to win on stored gibibytes per day when logs compress well and you enforce label discipline. Loki is not magic. It is cheap because it refuses to index everything the way Elasticsearch does. The trade is query behavior, which we cover below. In migrations where we replaced a commercial log SaaS with Loki on EKS plus S3 in eu-central-1-class pricing, we repeatedly landed in a band of roughly $0.03–$0.12 per ingested GB-month all-in at steady state for a mature deployment, with the wide range driven by retention, query parallelism, and how aggressively teams used LogQL range scans versus dashboards. Translate that mental model to 2 TB/day ingest: if you normalize crudely to 60 TB/month raw-ish before replication overhead, you are already in "talk to Finance with a spreadsheet" territory, not a tweet.

Elasticsearch or OpenSearch usually costs more per retained terabyte on the hot tier because the inverted index is greedy. You buy IOPS and large JVM heaps, then you buy more nodes when merges fall behind. Where ES can still win economically is odd: if the business refuses label discipline and insists on treating every field as searchable forever, Loki becomes either expensive in engineer time or expensive in S3 GETs once people brute-force exploration. ES makes that mess predictably expensive in hardware instead of surprisingly expensive in behavior.

ClickHouse often lands in the middle on storage thanks to columnar compression, and it can be shockingly good on cold S3-backed MergeTree if you design sorting keys and TTL rules with intent. The hidden line item is merge and parts pressure: under-provisioned merges show up as slow inserts, then as read amplification, then as a pager event. Teams that "saved money" on fewer vCPUs frequently spend it back in incident hours unless someone owns ClickHouse the way a DBA used to own Oracle.

We are not going to pretend one table of dollars applies to every tenant mix. What we do show buyers is a ratio model: pick a reference month, split ingest compute, query compute, hot storage, object storage, API requests, and support hours (internal or external). The winner per row changes. The system that wins the sum changes with workflow.

Below is an illustrative monthly envelope at ~2 TB/day sustained ingest (order of magnitude, not a quote). It exists so you can argue in the right room: Finance cares about totals, Engineering cares about which line moved when someone changed retention.

Cost bucket (example split)	Loki on EKS + S3	OpenSearch hot + S3 cold	ClickHouse on EKS + S3 cold
Ingest and compaction compute	Medium: scales with querier and compactor policy	High: JVM heap + indexing burns CPU	Medium–high: merges dominate at the edge
Hot block storage (30 days)	Lower if chunks compress well	Higher: replicas × shards × fast disks	Lower–medium: excellent compression, watch replication
Object storage steady state	Often the winner on $/GiB	Depends on frozen / snapshot layout	Often strong if parts stay wide and merges healthy
S3 API / request charges	Risk if queries are wide and parallel	Lower on interactive path if hot is local	Risk if parts are tiny or TTL is chaotic
People cost (honest)	Need Loki + Grafana depth	Easier hire, familiar incidents	Need a real ClickHouse owner

If one row makes you angry, good. That row is the conversation we want in week one, not in month nine when the CFO asks why S3 grew 40% after engineers got Grafana access company-wide.

Axis 2: Query flexibility and ergonomics (who gets happy, who gets blocked)

Elasticsearch is still the comfort food of text search. Lucene semantics, KQL or Query DSL, tokenizers, analyzers, fuzzy search, and field-centric relevance are solved problems people already know. If your security team thinks in Elastic Security detection rules and your analysts live in Kibana Discover, you are not going to "train them into LogQL" in a two-week migration window without political cost.

ClickHouse wins when logs are really semi-structured events and the questions are SQL-shaped: group by tenant, percentile latency over windows, funnel counts, joins between a logs table and a dimension table that should never have been joined in Elasticsearch in the first place. The ergonomics are ClickHouse SQL plus whatever thin BI layer you allow. If your organization already runs Grafana with the ClickHouse data source, Metabase, or Lightdash over the same warehouse, the curve is gentler than outsiders expect.

Loki is best when Kubernetes labels (namespace, pod, container, trace id) are the primary access path and LogQL pipelines feel natural to engineers who already live in Grafana. Loki is weaker when the primary workflow is ad hoc full-text forensics across unstructured blobs, especially when nobody knows which labels exist this week because fourteen teams ship logs differently.

Concrete pattern, three tools:

# Loki: cheap when you stay inside label selectors + line filters
{cluster="prod", namespace="checkout"} |= "payment_timeout"
  | json | status_code >= 500

// Elasticsearch: powerful when text and analyzers matter
{
  "query": {
    "bool": {
      "must": [
        { "match_phrase": { "message": "payment timeout" }},
        { "range": { "@timestamp": { "gte": "now-1h" }}}
      ]
    }
  }
}

-- ClickHouse: when you need SQL analytics over log-shaped rows
SELECT tenant_id, quantileTDigestMerge(0.99)(p99_ms) AS p99
FROM logs_per_minute
WHERE ts >= now() - INTERVAL 6 HOUR
GROUP BY tenant_id
ORDER BY p99 DESC
LIMIT 50;

If your engineers reach for the middle option every day and the first option feels "too limiting," that is not a small preference. It is a routing decision about which engine owns the hot path.

Axis 3: Operational complexity (what wakes you up at 03:00)

Elasticsearch complexity is familiar misery: JVM tuning, heap pressure, shard count regret, ILM policies that looked fine until reindex jobs stacked, zone awareness mistakes that only show up during an AZ failure. The upside is decades of runbooks and hiring market depth. If you operate ES today, you already know your failure modes. Migrating away is often a political problem disguised as a technical one.

Loki complexity is different: compaction, chunk and index shipping, ingester ring issues, object storage consistency assumptions, and ** cardinality explosions via bad labels. Loki can be boring on-call, but only if you enforce standards: max labels per stream, cardinality limits, rate limits on queries, and separate read paths** for humans versus batch jobs. We have seen teams "save money" by sharing queriers between Grafana dashboards and CI log scanners. They did not save money.

ClickHouse complexity is the one teams underestimate most often. MergeTree physics, parts explosion, mutation semantics, replicated merges, ZooKeeper or ClickHouse Keeper flakiness, backups that are not "S3 snapshot and hope," and version upgrades that require reading the release notes like civil defense instructions. ClickHouse rewards a real platform owner. If nobody is named on the RACI chart, ClickHouse is not cheaper. It is deferred.

Axis 4: S3 and object storage behavior (the silent budget killer)

Loki treats S3 (or compatible object storage) as the source of truth for chunks. That is the feature. It is also where naive deployments die: LIST and GET amplification during broad time range queries, small part sizes that looked fine at 50 GB/day and become a tax at 2 TB/day, lifecycle transitions that interact badly with compaction. We model S3 costs with the same seriousness as EC2. If your FinOps tool buckets everything as "S3: production logs," you are flying blind.

Elasticsearch hot tiers want fast local or EBS volumes; object storage shows up more often in frozen or searchable snapshots patterns depending on the distribution. The object storage curve is different: less chatty for interactive search if the hot set is sized correctly, more painful when forcemerge dreams meet reality.

ClickHouse with MergeTree on S3 can be extremely cost-effective for cold retention, but insert and merge patterns determine whether you are doing large sequential writes or accidentally building a million tiny parts that turn metadata operations into a bottleneck. ClickHouse plus bad defaults is how you meet S3 request billing personally.

Axis 5: Tooling ecosystem (what plugs in on Monday)

Elasticsearch wins raw ecosystem breadth: Beats, Logstash, Elastic Agent, security content, vendor integrations, books, courses, StackOverflow answers from 2014 that still work. If your compliance team has a checklist that literally names Elastic, that matters more than our opinion.

Loki wins Grafana-native shops: Explore, Alerting, Recording rules, Tempo correlation, Mimir next door, OpenTelemetry pipelines that already land in Grafana Agent or the OpenTelemetry Collector. If your north star is one UI and one operations language for metrics, logs, and traces, Loki is the pragmatic default even when it is not the theoretical optimum.

ClickHouse wins analytics and data platform adjacency: dbt, BI tools, batch ETL, federated queries, and teams that already have a data engineering bench. Logs become another table in a world that already thinks in tables.

A decision matrix we actually use in workshops

This is not a scorecard for children. Green does not mean "good." It means "lower pain for that axis given normal constraints." Your constraints will move cells.

Axis	Loki	Elasticsearch / OpenSearch	ClickHouse
Ingestion $/TB/day at multi-TB scale	Strong if labels are sane; S3 storage efficient	Higher hot-tier cost; predictable if you already operate ES	Strong compression; watch merges and parts
Query ergonomics	Great for label-first K8s workflows	Best for full-text + mature DSL	Best for SQL analytics and joins
Operational complexity	Different from ES; needs Loki-specific discipline	Known devil; deep hiring pool	Easy demo, hard production without an owner
S3 behavior	Chunk + index patterns; GET/LIST risk on bad queries	More EBS-shaped hot path; cold depends on architecture	Powerful cold tiers; merge mistakes bill silently
Ecosystem	Grafana-native wins	Broadest integrations	Data platform wins

Where we recommend against Loki even though we like the Grafana stack

We recommend against Loki as the primary log store when security analytics needs tokenized full-text, fuzzy matching, and rich detection content your team already built in Elastic Security, and when replatforming detections is not funded. We also push away from Loki when the dominant workflow is unstructured forensic search across unknown fields and the organization refuses schema or label governance. In those worlds, you either build a second store for analytics, accept Elasticsearch costs, or spend nine months fighting human behavior with limits that people route around.

Where Elasticsearch remains the right call

Elasticsearch is the right call when the team, the detections, and the compliance narrative are already Elastic-shaped, and the business value is search and security workflows, not cheap long-term retention. It is the right call when you need cross-field relevance and analyzer pipelines that are boring in ES and non-idiomatic in Loki. It is also the right call when hiring matters: you can recruit ES operators in most European markets without explaining why logs are not indexed "the normal way."

Where ClickHouse outperforms both, and where teams underestimate the bill

ClickHouse wins when log data is really business analytics with SQL, high cardinality grouping, long retention, and predictable query shapes, especially when paired with a thoughtful sorting key and TTL strategy. We have seen ClickHouse crush Elasticsearch on scan-heavy questions once tables are designed by someone who knows MergeTree.

The underestimate is always operations: merges, replication lag during incidents, upgrade planning, and on-call depth. ClickHouse is not " Postgres with columnar tricks." If you do not have a named owner, you do not have ClickHouse. You have a future postmortem.

Closing recommendation, bluntly

If you are a Kubernetes-first platform team standardizing on Grafana for metrics and traces, and your logs are label-disciplined, Loki plus S3 is usually the best default in the 1–5 TB/day envelope. If you are a security analytics shop living in Elastic content, stay on ES until you have a funded migration for detections, not just for indexes. If you are a data platform org with SQL power users and batch plus interactive analytics on the same log rows, ClickHouse is often the analytical winner, and you should budget real platform engineering, not a three-node hobby cluster.

Where to go next

If you are evaluating a multi-terabyte log migration off a commercial SaaS and you want a second opinion that includes when not to pick Loki, we publish this way on purpose. Etalon ships Grafana LGTM paths for AWS-native enterprises, and we also ship honest routing decisions between engines when the workload is not Grafana-shaped. Send us the constraints that never make it into the RFP: query patterns, retention politics, and who owns on-call. Those three facts decide the architecture. Everything else is commentary.