Summary
When the bundled Claude Code CLI is invoked from claude-agent-sdk-python's query() with a parent W3C trace context in env (TRACEPARENT), it correctly nests its claude_code.* spans under the caller's trace only on the first invocation in the process's lifetime. On the second and subsequent invocations in the same long-running Python process, the same valid TRACEPARENT is silently ignored — claude_code.interaction / claude_code.llm_request / claude_code.tool spans each emit with their own fresh trace IDs and no parent.
The trigger is the persistent state directory at ~/.claude/ (specifically ~/.claude.json, created by the CLI on first run with a firstStartTime marker). Wiping that directory between calls restores correct nesting; leaving it reproduces the bug 100% of the time.
Environment
claude-agent-sdk-python 0.1.x (Python 3.13)
- Bundled CLI as shipped with the above version
- Linux x86_64 container
- Telemetry envs at process level:
CLAUDE_CODE_ENABLE_TELEMETRY=1, CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1, OTEL_TRACES_EXPORTER=otlp, OTEL_EXPORTER_OTLP_ENDPOINT=..., etc.
- Backend: Langfuse Cloud (but the issue is shape, not destination — it would manifest the same on any OTLP collector)
Reproducer
```python
import asyncio
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from claude_agent_sdk import query, ClaudeAgentOptions
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter())
)
tracer = trace.get_tracer("repro")
async def one_call(label: str):
with tracer.start_as_current_span(label):
async for _ in query(prompt="echo hello", options=ClaudeAgentOptions()):
pass
asyncio.run(one_call("call-1")) # CLI spans nest under call-1 ✓
asyncio.run(one_call("call-2")) # CLI spans become orphan roots ✗
```
Tested both with the SDK's auto-injection and with explicit `TRACEPARENT` set in `ClaudeAgentOptions.env` — same behavior, ruling out auto-injection as the cause.
What I observed
For call-1: the `claude_code.interaction` span correctly carries the parent's trace_id from `TRACEPARENT`, and its `llm_request` / `tool` children inherit it. The whole interaction is one trace rooted at the caller's span.
For call-2 onward: each `claude_code.llm_request` and `claude_code.tool` emits with its own freshly-generated trace_id. There is no `claude_code.interaction` span at all in the second call's emission — only the children, each becoming its own trace root. So even the CLI's internal context propagation (interaction → its children) appears affected, not just the inbound TRACEPARENT.
I verified the Python side is byte-identical between calls 1 and 2: same `TracerProvider`, same `CompositePropagator`, valid sampled `TRACEPARENT` string with correct trace_id matching the active Python span, correctly landing in `options.env` and `process_env` for the spawned subprocess. The bug is on the CLI side.
What unblocks the bug
Wiping `~/.claude/` between calls makes the next call nest correctly. I confirmed this experimentally — clearing the dir, running call-A (works like a first call), running call-B without clearing (breaks again). The pattern is fully reproducible.
The most likely culprit is something in `~/.claude.json` — specifically the `firstStartTime` marker or one of the migration flags — that the CLI checks and uses to skip some OTel init or interaction-span construction step it does only on a "true first run."
Workaround
Override `HOME` to a unique throwaway path per `query()` call (e.g. `HOME=/tmp/agent-cli-` in `ClaudeAgentOptions.env`). The CLI then always thinks it's running for the first time and always honors `TRACEPARENT`. Cost: ~20KB per call accumulated in `/tmp` until container restart. Functional behavior unchanged because I don't use `--continue` / `--resume` / `--session-id`.
Impact
For anyone using the SDK in a long-running server / worker process that emits multiple `query()` calls (the dominant deployment shape per the Hosting the Agent SDK docs), every call after the first produces fragmented telemetry — making the Read agent traces flow described in the observability docs unusable past the first call without the workaround above.
Asks
- Confirm whether the CLI is supposed to re-read `TRACEPARENT` and re-establish parent context on every subprocess invocation regardless of `~/.claude/` state
- If yes, the regression is in whatever code path differs between "first start" and "subsequent start" — likely in the OTel SDK init or the interaction-span construction
- Happy to test a fix or provide more diagnostic data
Summary
When the bundled Claude Code CLI is invoked from
claude-agent-sdk-python'squery()with a parent W3C trace context in env (TRACEPARENT), it correctly nests itsclaude_code.*spans under the caller's trace only on the first invocation in the process's lifetime. On the second and subsequent invocations in the same long-running Python process, the same validTRACEPARENTis silently ignored —claude_code.interaction/claude_code.llm_request/claude_code.toolspans each emit with their own fresh trace IDs and no parent.The trigger is the persistent state directory at
~/.claude/(specifically~/.claude.json, created by the CLI on first run with afirstStartTimemarker). Wiping that directory between calls restores correct nesting; leaving it reproduces the bug 100% of the time.Environment
claude-agent-sdk-python0.1.x (Python 3.13)CLAUDE_CODE_ENABLE_TELEMETRY=1,CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1,OTEL_TRACES_EXPORTER=otlp,OTEL_EXPORTER_OTLP_ENDPOINT=..., etc.Reproducer
```python
import asyncio
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from claude_agent_sdk import query, ClaudeAgentOptions
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter())
)
tracer = trace.get_tracer("repro")
async def one_call(label: str):
with tracer.start_as_current_span(label):
async for _ in query(prompt="echo hello", options=ClaudeAgentOptions()):
pass
asyncio.run(one_call("call-1")) # CLI spans nest under call-1 ✓
asyncio.run(one_call("call-2")) # CLI spans become orphan roots ✗
```
Tested both with the SDK's auto-injection and with explicit `TRACEPARENT` set in `ClaudeAgentOptions.env` — same behavior, ruling out auto-injection as the cause.
What I observed
For call-1: the `claude_code.interaction` span correctly carries the parent's trace_id from `TRACEPARENT`, and its `llm_request` / `tool` children inherit it. The whole interaction is one trace rooted at the caller's span.
For call-2 onward: each `claude_code.llm_request` and `claude_code.tool` emits with its own freshly-generated trace_id. There is no `claude_code.interaction` span at all in the second call's emission — only the children, each becoming its own trace root. So even the CLI's internal context propagation (interaction → its children) appears affected, not just the inbound TRACEPARENT.
I verified the Python side is byte-identical between calls 1 and 2: same `TracerProvider`, same `CompositePropagator`, valid sampled `TRACEPARENT` string with correct trace_id matching the active Python span, correctly landing in `options.env` and `process_env` for the spawned subprocess. The bug is on the CLI side.
What unblocks the bug
Wiping `~/.claude/` between calls makes the next call nest correctly. I confirmed this experimentally — clearing the dir, running call-A (works like a first call), running call-B without clearing (breaks again). The pattern is fully reproducible.
The most likely culprit is something in `~/.claude.json` — specifically the `firstStartTime` marker or one of the migration flags — that the CLI checks and uses to skip some OTel init or interaction-span construction step it does only on a "true first run."
Workaround
Override `HOME` to a unique throwaway path per `query()` call (e.g. `HOME=/tmp/agent-cli-` in `ClaudeAgentOptions.env`). The CLI then always thinks it's running for the first time and always honors `TRACEPARENT`. Cost: ~20KB per call accumulated in `/tmp` until container restart. Functional behavior unchanged because I don't use `--continue` / `--resume` / `--session-id`.
Impact
For anyone using the SDK in a long-running server / worker process that emits multiple `query()` calls (the dominant deployment shape per the Hosting the Agent SDK docs), every call after the first produces fragmented telemetry — making the Read agent traces flow described in the observability docs unusable past the first call without the workaround above.
Asks