Skip to content

Add end-to-end trace propagation for Sentinel event ingestion #116

@Agent-Hellboy

Description

@Agent-Hellboy

Problem

Sentinel tracing currently instruments the HTTP portions of the event path and the processor's local worker spans, but the trace is not continuous across Kafka.

Current path:

mcp-proxy -> POST /events -> ingest -> Kafka mcp.events -> processor -> ClickHouse

What works today:

  • mcp-proxy uses OpenTelemetry HTTP instrumentation for inbound requests and outbound analytics calls.
  • ingest wraps its HTTP server with otelhttp, so POST /events gets an HTTP server span.
  • processor creates spans for kafka.consume and clickhouse.insert_batch.
  • Processor spans include useful attributes such as Kafka topic/partition/offset and ClickHouse batch size.

Missing piece:

  • ingest writes kafka.Message{Value: raw} without injecting W3C trace context into Kafka headers.
  • processor consumes Kafka messages and starts spans from its background context instead of extracting trace context from message headers.
  • As a result, Grafana/Tempo cannot show one connected trace from proxy analytics emission through ingest, Kafka, processor, and ClickHouse insert.

Proposed work

  • Inject OpenTelemetry/W3C trace context into Kafka message headers when ingest publishes to mcp.events.
  • Extract trace context from Kafka message headers in processor before starting kafka.consume spans.
  • Make kafka.consume and clickhouse.insert_batch appear as part of the same distributed trace, or use span links if batching makes strict parent/child semantics misleading.
  • Add stable event correlation, preferably an event_id or stored trace_id, so ClickHouse rows can be linked back to traces.
  • Add processor span attributes for batch boundaries, such as topic, partition, first offset, last offset, and batch size.
  • Add tests for Kafka header injection/extraction and context propagation behavior.

Acceptance criteria

  • A single audit event emitted by mcp-proxy can be followed in Tempo from HTTP ingestion through Kafka consumption and ClickHouse insertion.
  • Processor spans retain existing attributes and include enough batch metadata to debug lag or partial failures.
  • Invalid messages still get handled safely and do not block the consumer.
  • Existing health, metrics, batching, and offset-commit behavior remain unchanged.

Relevant code

  • services/ingest/main.go: handleEvents publishes Kafka messages.
  • services/processor/main.go: consumes Kafka messages and writes ClickHouse batches.
  • k8s/01-config.yaml: sets OTEL_EXPORTER_OTLP_ENDPOINT.
  • k8s/15-otel-collector.yaml and k8s/16-tempo.yaml: trace collection/storage pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions