June 1, 2026 · Mihai · 9 min read

Instrumenting LLM Applications with OpenTelemetry GenAI Semantic Conventions: A Production Walkthrough

The OpenTelemetry GenAI semantic conventions reached stable status in early 2026, and if you are running LLM-backed features in production, you now have a vendor-neutral way to capture token counts, model latency, prompt content, and cost attribution without bolting on a proprietary SDK. This post walks through exactly how we wire that up against a real application stack — AWS Bedrock, a Python FastAPI service, the OTel Collector, and Grafana — and what the traces actually look like once data is flowing.

OpenTelemetry graduated as a CNCF project this spring. That matters less as a badge and more as a signal: the GenAI semantic conventions that have been in flux for two years are now stable enough to build production tooling against. If your team ships LLM-backed features, this is the moment to stop treating model observability as a special case and start treating it as just another span.

This post is a concrete walkthrough. We will instrument a Python FastAPI service that calls AWS Bedrock, pipe telemetry through the OTel Collector, store traces in Tempo and metrics in Mimir, and visualize everything in Grafana. We will also be honest about where the conventions are still rough and where you will hit friction on AWS specifically.

Why the GenAI Conventions Matter Now

Before these conventions stabilized, every observability vendor invented their own attribute names. LangSmith used one schema. Datadog's LLM Observability used another. Arize, Weights & Biases, Helicone — all different. If you instrumented for one, you were locked to one.

The OTel GenAI semantic conventions define a shared vocabulary. The important ones for a Bedrock deployment:

Attribute Type What it captures
gen_ai.system string aws.bedrock, openai, anthropic, etc.
gen_ai.request.model string anthropic.claude-3-5-sonnet-20241022-v2:0
gen_ai.request.max_tokens int Token limit set by the caller
gen_ai.response.finish_reasons string[] end_turn, max_tokens, stop_sequence
gen_ai.usage.input_tokens int Tokens consumed in the prompt
gen_ai.usage.output_tokens int Tokens in the completion
gen_ai.operation.name string chat, text_completion, embeddings

Those last two — input_tokens and output_tokens — are what turn a trace into a cost attribution tool. At $3 per million input tokens and $15 per million output tokens for Claude 3.5 Sonnet, a single span can carry enough information to reconstruct your Bedrock bill by feature, by user, by tenant.

The Stack

For this walkthrough:

  • Application: Python 3.12, FastAPI, boto3 for Bedrock calls
  • Instrumentation: opentelemetry-sdk 1.28, opentelemetry-instrumentation-aws-lambda where relevant, manual spans for Bedrock (no auto-instrumentation exists yet for Bedrock — more on this below)
  • Collector: OTel Collector Contrib 0.102 on ECS Fargate as a sidecar
  • Trace backend: Grafana Tempo 2.5 on EKS, S3 object storage
  • Metrics backend: Grafana Mimir 2.13 on EKS
  • Dashboards: Grafana 11.x

Instrumenting the Bedrock Call

There is no stable auto-instrumentation library for AWS Bedrock as of mid-2026. The opentelemetry-instrumentation-boto3 package exists but does not yet emit GenAI semantic convention attributes — it emits generic AWS SDK spans. You need to wrap the call yourself.

Here is a minimal but production-usable wrapper:

import time
import json
import boto3
from opentelemetry import trace
from opentelemetry.trace import SpanKind
from opentelemetry.semconv._incubating.attributes import gen_ai_attributes as GenAI

tracer = trace.get_tracer("etalon.bedrock", "1.0.0")

def invoke_claude(
    prompt: str,
    model_id: str = "anthropic.claude-3-5-sonnet-20241022-v2:0",
    max_tokens: int = 1024,
) -> dict:
    client = boto3.client("bedrock-runtime", region_name="eu-west-1")

    with tracer.start_as_current_span(
        "chat anthropic.claude",
        kind=SpanKind.CLIENT,
        attributes={
            GenAI.GEN_AI_SYSTEM: "aws.bedrock",
            GenAI.GEN_AI_OPERATION_NAME: "chat",
            GenAI.GEN_AI_REQUEST_MODEL: model_id,
            GenAI.GEN_AI_REQUEST_MAX_TOKENS: max_tokens,
        },
    ) as span:
        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "messages": [{"role": "user", "content": prompt}],
        })

        response = client.invoke_model(
            modelId=model_id,
            body=body,
            contentType="application/json",
            accept="application/json",
        )

        result = json.loads(response["body"].read())

        # Emit token usage — this is what drives cost attribution
        usage = result.get("usage", {})
        span.set_attribute(GenAI.GEN_AI_USAGE_INPUT_TOKENS, usage.get("input_tokens", 0))
        span.set_attribute(GenAI.GEN_AI_USAGE_OUTPUT_TOKENS, usage.get("output_tokens", 0))
        span.set_attribute(
            GenAI.GEN_AI_RESPONSE_FINISH_REASONS,
            [result.get("stop_reason", "unknown")],
        )

        return result

A few things worth noting here. The span name follows the convention {operation} {model_short_name} — this is specified in the semantic conventions and matters because Tempo uses the span name for service graph edges. If you use a random name, your service graph becomes noise.

The GenAI.GEN_AI_USAGE_INPUT_TOKENS constant comes from the incubating attributes module. As of OTel Python SDK 1.28, these are stable but still in the _incubating namespace for historical reasons. Do not let that stop you from using them in production.

Collector Configuration

The Collector sidecar runs on the same ECS task as the application. It receives OTLP over gRPC on port 4317, batches spans, and exports to Tempo and Mimir.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert
  # Extract token usage metrics from spans
  # so we can alert on cost spikes without querying Tempo
  spanmetrics:
    metrics_exporter: prometheusremotewrite
    dimensions:
      - name: gen_ai.system
      - name: gen_ai.request.model
      - name: gen_ai.operation.name
    histogram:
      explicit:
        buckets: [50, 100, 250, 500, 1000, 2500, 5000]

exporters:
  otlp/tempo:
    endpoint: https://tempo.internal.etalon.systems:4317
    tls:
      insecure: false
  prometheusremotewrite:
    endpoint: https://mimir.internal.etalon.systems/api/v1/push
    headers:
      X-Scope-OrgID: production

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [spanmetrics, batch]
      exporters: [prometheusremotewrite]

The spanmetrics processor is doing real work here. It converts span attributes into Prometheus-compatible metrics, which means you can write PromQL against token usage without having to query Tempo for every cost calculation. That matters at scale: Tempo is optimized for trace retrieval, not for aggregate queries over millions of spans.

What You Get in Grafana

Once data flows, you can write PromQL like this to get token cost attribution by model:

# Estimated hourly cost in USD for input tokens, by model
sum by (gen_ai_request_model) (
  rate(
    calls_total{
      span_kind="client",
      gen_ai_system="aws.bedrock"
    }[1h]
  )
  * on(gen_ai_request_model)
  group_left()
  (
    sum by (gen_ai_request_model) (
      increase(
        duration_milliseconds_sum{
          gen_ai_system="aws.bedrock"
        }[1h]
      )
    )
  )
)

That query is illustrative — real cost attribution requires joining token counts to a price table, which you maintain as a recording rule or a static config map. We keep a gen_ai_model_price_per_1k_input_tokens metric that we push from a small Lambda that scrapes the Bedrock pricing page weekly. Ugly, but it works until AWS exposes pricing via API.

For trace-level investigation, a TraceQL query in Tempo to find all requests that hit the token limit:

{ span.gen_ai.response.finish_reasons =~ ".*max_tokens.*" && span.gen_ai.system = "aws.bedrock" }
| select(span.gen_ai.usage.input_tokens, span.gen_ai.usage.output_tokens, span.gen_ai.request.model)

This surfaces the requests where you paid for a full context window but got a truncated response — a common source of silent quality degradation that never shows up in error rates.

Honest Tradeoffs and Where This Falls Apart

Prompt capture is a compliance decision, not a technical one

The GenAI conventions include gen_ai.prompt and gen_ai.completion attributes for storing the actual prompt and response text. Do not enable these without a legal review. Prompts in a B2B SaaS context routinely contain PII, customer data, and occasionally credentials that users paste in. Storing them in a trace backend that your entire engineering team can query is a data governance problem.

We recommend capturing prompt hashes (SHA-256 of the normalized prompt) for deduplication and caching analysis, and storing full prompt content only in a separate, access-controlled store with a short retention window if you need it at all.

No auto-instrumentation for Bedrock means instrumentation drift

Every new Bedrock model invocation pattern — streaming responses, Converse API, Agents for Bedrock — requires manual instrumentation. The Python wrapper above handles invoke_model. The invoke_model_with_response_stream path needs separate handling because token counts only arrive at the end of the stream, which means you need to buffer them and set span attributes after the generator is exhausted.

This is genuinely annoying. The opentelemetry-python-contrib repository has an open PR for Bedrock auto-instrumentation but it has not merged as of this writing. Until it does, you are maintaining wrapper code.

spanmetrics cardinality can explode

If you add gen_ai.request.model as a dimension in spanmetrics and your application lets users specify arbitrary model IDs (or you start testing fine-tuned model variants), you will generate high-cardinality metrics. Mimir handles this better than vanilla Prometheus, but you should still set explicit allowlists on the dimension values in your Collector config.

Tempo's storage costs scale with prompt verbosity

If you do capture prompt content in spans, be aware that a single trace for a RAG pipeline with a 32K-token context window will serialize to several hundred kilobytes. At 10,000 requests per day, that is gigabytes of trace data per day from a single service. S3 is cheap, but Tempo's compaction and query performance degrade if object sizes are wildly inconsistent. Set a hard cap on attribute value length in the Collector:

processors:
  attributes:
    actions:
      - key: gen_ai.prompt
        action: update
        value: "[REDACTED]"

Or use the truncate action if you want partial capture for debugging.

What This Looks Like End to End

A single user request to a RAG-backed feature produces a trace with roughly this shape:

HTTP POST /api/v1/answer   [FastAPI]  450ms
  ├── vector_search         [pgvector] 12ms
  ├── context_assembly      [Python]   3ms
  └── chat anthropic.claude [Bedrock]  430ms
        gen_ai.usage.input_tokens:  4821
        gen_ai.usage.output_tokens: 312
        gen_ai.response.finish_reasons: ["end_turn"]
        gen_ai.request.model: anthropic.claude-3-5-sonnet-20241022-v2:0

From that single trace, you can see that 95% of the request latency is Bedrock, the model consumed 4,821 input tokens (at $3/million, that is $0.0000145 per request), and the model finished cleanly. Multiply that by your request volume and you have a cost forecast that matches your AWS bill within a few percent.

We have run this setup for a client handling roughly 80,000 Bedrock invocations per day. The Mimir cost dashboard tracks within 3% of the actual AWS Cost Explorer number, which is close enough to catch cost spikes in real time rather than at invoice time.

Where to Go Next

The OTel GenAI semantic conventions specification is at opentelemetry.io/docs/specs/semconv/gen-ai/ — read the span naming section carefully before you start, because renaming spans later means rewriting all your Tempo queries and Grafana panels.

The opentelemetry-python-contrib repository has the most active development on GenAI instrumentation. Watch the instrumentation/opentelemetry-instrumentation-bedrock directory; when that auto-instrumentation lands, you can drop the manual wrapper.

If you are running this on a stack that also handles non-LLM workloads — which is almost everyone — the architecture above composes cleanly with existing OTel pipelines. The GenAI spans are just spans. They flow through the same Collector, land in the same Tempo instance, and show up in the same service map alongside your database queries and HTTP calls. That is the point of stable conventions: LLM observability stops being a special case.

If you are early in this process — figuring out whether to self-host Tempo and Mimir or still evaluating the build vs. buy question — we have done this migration for several teams and the setup time is shorter than most people expect. The Etalon team is happy to walk through the architecture for your specific workload if that would be useful.

Category: Observability

Comments

Leave a comment