Skip to content

Bundled CLI silently drops inbound TRACEPARENT on 2nd+ query() call when ~/.claude/ has state from a prior run #952

@NBTDx

Description

@NBTDx

Summary

When the bundled Claude Code CLI is invoked from claude-agent-sdk-python's query() with a parent W3C trace context in env (TRACEPARENT), it correctly nests its claude_code.* spans under the caller's trace only on the first invocation in the process's lifetime. On the second and subsequent invocations in the same long-running Python process, the same valid TRACEPARENT is silently ignored — claude_code.interaction / claude_code.llm_request / claude_code.tool spans each emit with their own fresh trace IDs and no parent.

The trigger is the persistent state directory at ~/.claude/ (specifically ~/.claude.json, created by the CLI on first run with a firstStartTime marker). Wiping that directory between calls restores correct nesting; leaving it reproduces the bug 100% of the time.

Environment

  • claude-agent-sdk-python 0.1.x (Python 3.13)
  • Bundled CLI as shipped with the above version
  • Linux x86_64 container
  • Telemetry envs at process level: CLAUDE_CODE_ENABLE_TELEMETRY=1, CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1, OTEL_TRACES_EXPORTER=otlp, OTEL_EXPORTER_OTLP_ENDPOINT=..., etc.
  • Backend: Langfuse Cloud (but the issue is shape, not destination — it would manifest the same on any OTLP collector)

Reproducer

```python
import asyncio
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

from claude_agent_sdk import query, ClaudeAgentOptions

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter())
)
tracer = trace.get_tracer("repro")

async def one_call(label: str):
with tracer.start_as_current_span(label):
async for _ in query(prompt="echo hello", options=ClaudeAgentOptions()):
pass

asyncio.run(one_call("call-1")) # CLI spans nest under call-1 ✓
asyncio.run(one_call("call-2")) # CLI spans become orphan roots ✗
```

Tested both with the SDK's auto-injection and with explicit `TRACEPARENT` set in `ClaudeAgentOptions.env` — same behavior, ruling out auto-injection as the cause.

What I observed

For call-1: the `claude_code.interaction` span correctly carries the parent's trace_id from `TRACEPARENT`, and its `llm_request` / `tool` children inherit it. The whole interaction is one trace rooted at the caller's span.

For call-2 onward: each `claude_code.llm_request` and `claude_code.tool` emits with its own freshly-generated trace_id. There is no `claude_code.interaction` span at all in the second call's emission — only the children, each becoming its own trace root. So even the CLI's internal context propagation (interaction → its children) appears affected, not just the inbound TRACEPARENT.

I verified the Python side is byte-identical between calls 1 and 2: same `TracerProvider`, same `CompositePropagator`, valid sampled `TRACEPARENT` string with correct trace_id matching the active Python span, correctly landing in `options.env` and `process_env` for the spawned subprocess. The bug is on the CLI side.

What unblocks the bug

Wiping `~/.claude/` between calls makes the next call nest correctly. I confirmed this experimentally — clearing the dir, running call-A (works like a first call), running call-B without clearing (breaks again). The pattern is fully reproducible.

The most likely culprit is something in `~/.claude.json` — specifically the `firstStartTime` marker or one of the migration flags — that the CLI checks and uses to skip some OTel init or interaction-span construction step it does only on a "true first run."

Workaround

Override `HOME` to a unique throwaway path per `query()` call (e.g. `HOME=/tmp/agent-cli-` in `ClaudeAgentOptions.env`). The CLI then always thinks it's running for the first time and always honors `TRACEPARENT`. Cost: ~20KB per call accumulated in `/tmp` until container restart. Functional behavior unchanged because I don't use `--continue` / `--resume` / `--session-id`.

Impact

For anyone using the SDK in a long-running server / worker process that emits multiple `query()` calls (the dominant deployment shape per the Hosting the Agent SDK docs), every call after the first produces fragmented telemetry — making the Read agent traces flow described in the observability docs unusable past the first call without the workaround above.

Asks

  • Confirm whether the CLI is supposed to re-read `TRACEPARENT` and re-establish parent context on every subprocess invocation regardless of `~/.claude/` state
  • If yes, the regression is in whatever code path differs between "first start" and "subsequent start" — likely in the OTel SDK init or the interaction-span construction
  • Happy to test a fix or provide more diagnostic data

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions