Problem
Sentinel tracing currently instruments the HTTP portions of the event path and the processor's local worker spans, but the trace is not continuous across Kafka.
Current path:
mcp-proxy -> POST /events -> ingest -> Kafka mcp.events -> processor -> ClickHouse
What works today:
mcp-proxy uses OpenTelemetry HTTP instrumentation for inbound requests and outbound analytics calls.
ingest wraps its HTTP server with otelhttp, so POST /events gets an HTTP server span.
processor creates spans for kafka.consume and clickhouse.insert_batch.
- Processor spans include useful attributes such as Kafka topic/partition/offset and ClickHouse batch size.
Missing piece:
ingest writes kafka.Message{Value: raw} without injecting W3C trace context into Kafka headers.
processor consumes Kafka messages and starts spans from its background context instead of extracting trace context from message headers.
- As a result, Grafana/Tempo cannot show one connected trace from proxy analytics emission through ingest, Kafka, processor, and ClickHouse insert.
Proposed work
- Inject OpenTelemetry/W3C trace context into Kafka message headers when
ingest publishes to mcp.events.
- Extract trace context from Kafka message headers in
processor before starting kafka.consume spans.
- Make
kafka.consume and clickhouse.insert_batch appear as part of the same distributed trace, or use span links if batching makes strict parent/child semantics misleading.
- Add stable event correlation, preferably an
event_id or stored trace_id, so ClickHouse rows can be linked back to traces.
- Add processor span attributes for batch boundaries, such as topic, partition, first offset, last offset, and batch size.
- Add tests for Kafka header injection/extraction and context propagation behavior.
Acceptance criteria
- A single audit event emitted by
mcp-proxy can be followed in Tempo from HTTP ingestion through Kafka consumption and ClickHouse insertion.
- Processor spans retain existing attributes and include enough batch metadata to debug lag or partial failures.
- Invalid messages still get handled safely and do not block the consumer.
- Existing health, metrics, batching, and offset-commit behavior remain unchanged.
Relevant code
services/ingest/main.go: handleEvents publishes Kafka messages.
services/processor/main.go: consumes Kafka messages and writes ClickHouse batches.
k8s/01-config.yaml: sets OTEL_EXPORTER_OTLP_ENDPOINT.
k8s/15-otel-collector.yaml and k8s/16-tempo.yaml: trace collection/storage pipeline.
Problem
Sentinel tracing currently instruments the HTTP portions of the event path and the processor's local worker spans, but the trace is not continuous across Kafka.
Current path:
What works today:
mcp-proxyuses OpenTelemetry HTTP instrumentation for inbound requests and outbound analytics calls.ingestwraps its HTTP server withotelhttp, soPOST /eventsgets an HTTP server span.processorcreates spans forkafka.consumeandclickhouse.insert_batch.Missing piece:
ingestwriteskafka.Message{Value: raw}without injecting W3C trace context into Kafka headers.processorconsumes Kafka messages and starts spans from its background context instead of extracting trace context from message headers.Proposed work
ingestpublishes tomcp.events.processorbefore startingkafka.consumespans.kafka.consumeandclickhouse.insert_batchappear as part of the same distributed trace, or use span links if batching makes strict parent/child semantics misleading.event_idor storedtrace_id, so ClickHouse rows can be linked back to traces.Acceptance criteria
mcp-proxycan be followed in Tempo from HTTP ingestion through Kafka consumption and ClickHouse insertion.Relevant code
services/ingest/main.go:handleEventspublishes Kafka messages.services/processor/main.go: consumes Kafka messages and writes ClickHouse batches.k8s/01-config.yaml: setsOTEL_EXPORTER_OTLP_ENDPOINT.k8s/15-otel-collector.yamlandk8s/16-tempo.yaml: trace collection/storage pipeline.