Adaptive Logs Are Interesting. Here Is What They Do Not Tell You About Running Loki at Scale on AWS.

Grafana's Adaptive Logs feature, which shipped to Grafana Cloud recently, uses a sampling model to automatically drop high-volume, low-signal log lines. The pitch is compelling: cut your log ingestion bill without writing a single drop rule by hand. We have spent the last several months building self-hosted Loki deployments on AWS for clients migrating off Datadog and Splunk, and the question we keep getting is: can we get the same thing in a self-hosted stack? The honest answer is yes, partially, and the gap between 'partially' and 'fully' will determine whether your migration succeeds or quietly fails six months in.

Grafana's Adaptive Logs feature shipped to Grafana Cloud recently and landed in a few observability newsletters this week. The pitch is real: it watches your log stream, identifies lines that are high-volume and low-signal, and generates drop rules automatically. For Grafana Cloud customers, this is a meaningful quality-of-life improvement.

For the rest of us running self-hosted Loki on AWS — which is most of the companies we work with — the feature does not exist in the same form. What does exist is a set of primitives that, assembled correctly, get you 80–90% of the same outcome. The remaining 10–20% requires operational discipline that no feature flag will replace.

This post is about that assembly. Specifically: how we structure log volume control in a self-hosted Loki deployment on AWS, what the actual cost levers are, and where the approach breaks down.

The Real Cost Structure of Self-Hosted Loki

Before touching any configuration, you need to understand where your money actually goes. In a Grafana Cloud deployment, you pay per GB ingested and per GB queried. In a self-hosted Loki deployment on AWS, the cost structure is different and less obvious.

A typical production Loki deployment on AWS — distributor, ingester, querier, query-frontend, compactor, ruler, all running on EKS — has roughly this cost profile:

Cost center	Typical % of total	Notes
S3 storage (chunks + index)	30–40%	Grows linearly with retention
EC2 / EKS node compute	25–35%	Dominated by ingesters and queriers
S3 API calls (PUT/GET)	10–20%	Often underestimated; PUTs are expensive
Data transfer (intra-AZ, cross-AZ)	5–15%	Cross-AZ between components hurts
CloudWatch, NAT Gateway, misc	5–10%	Often ignored until it isn't

The numbers above come from three client deployments we audited in the last quarter: one at ~800 GB/day ingestion, one at ~2.1 TB/day, and one at ~4.8 TB/day. The percentages are consistent across all three, which suggests the cost model is stable at scale.

The implication: dropping log volume helps S3 storage and S3 API costs immediately. It helps compute costs only if you can right-size your ingesters after the drop. Naive volume reduction without right-sizing gives you 30–40% savings, not 60–70%.

Three Layers of Log Volume Control in Self-Hosted Loki

Adaptive Logs in Grafana Cloud is essentially a managed version of what you can build yourself using three layers. We always implement all three, in order, because each catches a different class of waste.

Layer 1: Drop at the Collector (OpenTelemetry Collector or Promtail)

The cheapest log is one that never leaves the application host. Dropping at the collector means the data never hits your network, never hits Loki's distributor, and never generates an S3 PUT.

With the OpenTelemetry Collector's filterprocessor, you can drop by log body content, severity, or any attribute:

processors:
  filter/drop_debug:
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_INFO'

  filter/drop_health_checks:
    logs:
      log_record:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/readyz"'
        - 'attributes["http.target"] == "/livez"'

  filter/drop_noisy_service:
    logs:
      log_record:
        - 'resource.attributes["service.name"] == "payment-gateway" and severity_number < SEVERITY_NUMBER_WARN'

This is blunt. You are making a binary decision: keep or drop. The risk is that you drop something you needed during an incident. We always recommend keeping a 1–5% sample of dropped lines routed to a separate, short-retention S3 bucket via an awss3exporter so you have a recovery path.

In practice, dropping DEBUG and TRACE at the collector — which is safe for almost every production workload — reduces ingestion volume by 20–40% at companies that have not already done this. That number is consistently surprising to clients who assumed their developers were not logging at DEBUG in production. They are.

Layer 2: Per-Stream Rate Limits and Drop Rules in Loki

Loki has native per-stream and per-tenant rate limiting that most teams configure once and never revisit. The relevant configuration lives in the limits_config section of your Loki config and can be overridden per-tenant via the ruler API.

limits_config:
  # Global ingestion rate limit per tenant (bytes/second)
  ingestion_rate_mb: 64
  ingestion_burst_size_mb: 128

  # Per-stream rate limit — this is the important one
  per_stream_rate_limit: 8MB
  per_stream_rate_limit_burst: 16MB

  # Reject logs with more than this many labels
  max_label_names_per_series: 15

  # Retention per tenant
  retention_period: 30d

The per_stream_rate_limit is your most effective tool for containing runaway services. When a single service starts logging at 500 MB/minute during an incident, this limit contains the blast radius. Without it, one misbehaving service can exhaust your ingester memory and take down the entire cluster.

For the Adaptive Logs equivalent — automatic drop rules — you need to build a feedback loop yourself. The approach we use:

Query Loki's loki_distributor_bytes_received_total metric in Mimir (or Prometheus) to identify streams growing faster than a threshold.
Alert when any stream exceeds a volume budget (e.g., >5% of total ingestion from a single {service_name, log_level} combination).
Generate a drop rule via Loki's per-tenant override API.

This is not as elegant as Adaptive Logs, but it is auditable, version-controllable, and does not require a Grafana Cloud subscription.

Layer 3: Compactor Retention and Tiered Storage

Most self-hosted Loki deployments we audit are storing everything at the same retention period. This is expensive and usually unnecessary.

Loki's compactor supports per-stream retention rules, which means you can keep high-signal streams (errors, security events, audit logs) for 90 days while keeping DEBUG logs from your internal tooling for 7 days.

compactor:
  retention_enabled: true

limits_config:
  retention_period: 30d  # default

# Per-stream overrides via ruler API or config file
overrides:
  "tenant-production":
    retention_period: 30d
    retention_stream:
      - selector: '{log_level="debug"}'
        period: 7d
      - selector: '{service_name="internal-tooling"}'
        period: 3d
      - selector: '{log_level="error"}'
        period: 90d

Combining differential retention with S3 Intelligent-Tiering for the chunks bucket typically reduces S3 costs by 25–35% without touching ingestion volume at all. S3 Intelligent-Tiering moves objects that have not been accessed in 30 days to a lower-cost tier automatically. For log data, access patterns are heavily front-loaded — most queries hit the last 24–72 hours — so Intelligent-Tiering works well.

The Architecture Decision That Determines Everything

The single biggest architectural decision in a self-hosted Loki deployment on AWS is whether you run in simple scalable mode (the default for new deployments) or microservices mode.

Simple scalable mode runs read, write, and backend targets. It is easier to operate and sufficient up to roughly 1–1.5 TB/day ingestion on reasonable instance sizes. Above that, you need microservices mode to independently scale ingesters, queriers, and query-frontends.

The reason this matters for cost control: in simple scalable mode, you cannot scale the write path independently of the read path. If your ingestion spikes, you scale up the write target, which also scales your backend. In microservices mode, you scale ingesters independently, which is more efficient.

For clients at 800 GB/day, we run simple scalable mode on three r6g.2xlarge write nodes and three r6g.xlarge read nodes. For clients at 4.8 TB/day, we run microservices mode with 12 ingesters on r6g.4xlarge, 6 queriers on r6g.2xlarge, and 3 query-frontends on r6g.xlarge. The cost difference between running simple scalable mode at 4.8 TB/day versus microservices mode is roughly $3,200/month in compute alone, because simple scalable mode forces you to over-provision.

What the Self-Hosted Approach Does Not Give You

I want to be direct about the gaps, because glossing over them is how migrations fail.

Adaptive Logs' ML-based signal scoring does not exist in self-hosted Loki. Grafana Cloud's Adaptive Logs uses a model that scores log lines by their correlation with alert firings and dashboard queries. It learns which log patterns actually get looked at. The self-hosted equivalent is a human writing rules based on volume metrics. This is slower, requires ongoing attention, and will miss patterns that a model would catch. If your team does not have someone who will own this process, the gap is real.

Loki's ruler-based alerting is not as mature as Grafana Cloud's managed alerting. Running the ruler component yourself means owning its availability. We have seen ruler OOMs during high-cardinality query evaluation. The mitigation is to run the ruler on dedicated nodes with conservative memory limits and to shard alert evaluation across multiple ruler instances, but this adds operational overhead.

Cross-AZ data transfer costs are not obvious until they are. In a multi-AZ EKS deployment, Loki components will communicate across AZs unless you pin them with topology spread constraints and use Kubernetes topology-aware routing. We have seen cross-AZ transfer add $800–1,400/month to deployments that were not configured for AZ locality. The fix is straightforward but requires deliberate configuration:

# In your Loki Helm values
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/component: ingester

And you need to enable topology-aware routing in your EKS cluster so that Loki's gRPC calls between components prefer same-AZ endpoints. Without this, every ingester-to-querier call may cross AZ boundaries.

Putting Numbers on It

Here is what a well-tuned self-hosted Loki deployment looks like versus a Grafana Cloud deployment for the same workload. These are real numbers from a client migration we completed in Q1 2026, anonymized.

Workload: 2.1 TB/day ingestion, 45-day retention, 12 services, mixed log verbosity.

Cost item	Grafana Cloud	Self-hosted (AWS)	Notes
Log ingestion/storage	$18,400/mo	—	Grafana Cloud pricing at 2.1 TB/day
S3 storage + API calls	—	$2,100/mo	With Intelligent-Tiering
EC2/EKS compute	—	$3,800/mo	Microservices mode, r6g instances
EKS control plane + misc	—	$420/mo
Engineering overhead (amortized)	~$200/mo	~$1,200/mo	2h/week vs 8h/week ongoing ops
Total	~$18,600/mo	~$7,520/mo

The 60% cost reduction is real. The engineering overhead line is also real. We estimate 8 hours per week of ongoing operational attention for a deployment at this scale — patching, capacity planning, incident response for the observability stack itself. For teams that do not have that capacity, the savings evaporate.

The honest framing: self-hosted Loki is a good decision if you have a platform team that can own it, or if you are willing to hire one. It is a bad decision if you are expecting it to run itself.

Where to Go From Here

If you are evaluating a migration from Grafana Cloud, Datadog, or Splunk to self-hosted Loki on AWS, the three questions worth answering before you start:

What is your current ingestion volume, and do you know which services are responsible for the top 20% of it? If not, instrument that first — you cannot tune what you cannot see.
Do you have a platform team with Kubernetes and AWS operational experience, or will the migration itself create the team? The latter is valid but slower.
What is your tolerance for the operational gap in features like Adaptive Logs? Some clients decide the gap is acceptable; others decide to run a hybrid — self-hosted for high-volume, low-sensitivity logs, Grafana Cloud for high-sensitivity streams.

We have done this migration enough times to have strong opinions about the sequencing, the common failure modes, and the AWS-specific configuration that most documentation skips. If you are at the evaluation stage and want a second opinion on your current Loki architecture or a cost model for your specific workload, we are available for a technical call. No sales deck — just an engineer who has done this before looking at your numbers with you.