Distributed Tracing¶

Real-time trace correlation and analysis for MCP Mesh using Redis Streams

Overview¶

MCP Mesh implements a high-performance distributed tracing system built on Redis Streams that provides end-to-end visibility into request flows across multiple agents. Unlike traditional OpenTelemetry setups, this system is specifically designed for MCP's JSON-RPC protocol with automatic context propagation and real-time correlation.

Architecture Components¶

1. Python Agent Tracing (Publishers)¶

Python agents automatically publish trace events to Redis Streams when decorated with @mesh.tool():

@app.tool()
@mesh.tool(depends_on=["data-processor"])
async def generate_report(title: str) -> str:
    # Automatic trace context creation and propagation
    # publishes span_start -> calls dependency -> publishes span_end
    processor = await mesh.get_agent("data-processor")
    return await processor.process_data({"title": title})

Event Types Published: - span_start: Operation begins - span_end: Operation completes successfully
- error: Operation fails with error details

2. Redis Streams (Transport Layer)¶

Stream Name: mcp-mesh:traces Consumer Group: mcp-mesh-registry-processors

Events are published asynchronously without blocking agent operations:

# View recent trace events
redis-cli XREVRANGE mcp-mesh:traces + - COUNT 10

# Monitor stream length
redis-cli XLEN mcp-mesh:traces

3. Go Registry (Consumer & Correlator)¶

The registry consumes events and correlates them into complete traces:

Consumer: Reads from Redis Streams with automatic failover
Correlator: Builds complete traces from individual span events
Exporters: Output traces in multiple formats (console, JSON, stats)

Configuration¶

Environment Variables¶

Variable	Default	Description
`MCP_MESH_DISTRIBUTED_TRACING_ENABLED`	`false`	Enable tracing system
`TRACE_EXPORTER_TYPE`	`console`	Export format
`TRACE_PRETTY_OUTPUT`	`true`	Pretty console output
`TRACE_ENABLE_STATS`	`true`	Collect statistics
`TRACE_JSON_OUTPUT_DIR`	`/tmp`	JSON export directory
`TRACE_BATCH_SIZE`	`100`	Consumer batch size
`TRACE_TIMEOUT`	`5m`	Trace completion timeout

Enable Tracing¶

# Enable in registry
export MCP_MESH_DISTRIBUTED_TRACING_ENABLED=true
export TRACE_EXPORTER_TYPE=console
export TRACE_PRETTY_OUTPUT=true

meshctl start --registry-only

Python agents automatically detect when tracing is enabled and begin publishing events.

Trace Data Model¶

TraceEvent Structure¶

{
  "trace_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "span_id": "x1y2z3w4-a5b6-c789-def0-123456789abc", 
  "parent_span": "parent-span-id-if-exists",
  "agent_name": "weather-service",
  "agent_id": "weather-123",
  "ip_address": "192.168.1.100",
  "event_type": "span_start|span_end|error",
  "operation": "tool:get_weather",
  "timestamp": 1640995200.123,
  "duration_ms": 150,
  "success": true,
  "error_message": null,
  "capability": "get_weather",
  "target_agent": "data-processor",
  "runtime": "python-3.11"
}

CompletedTrace Structure¶

{
  "trace_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "spans": [
    {
      "span_id": "x1y2z3w4-a5b6-c789-def0-123456789abc",
      "agent_name": "weather-service", 
      "operation": "tool:get_weather",
      "start_time": "2024-01-01T10:00:00Z",
      "end_time": "2024-01-01T10:00:00.150Z",
      "duration_ms": 150,
      "success": true
    }
  ],
  "start_time": "2024-01-01T10:00:00Z",
  "end_time": "2024-01-01T10:00:00.300Z", 
  "duration": "300ms",
  "success": true,
  "span_count": 3,
  "agent_count": 2,
  "agents": ["weather-service", "data-processor"]
}

Trace Correlation Logic¶

1. Event Collection¶

Events are correlated by trace_id and individual spans by span_id:

span_start[trace_id=ABC, span_id=123] + span_end[trace_id=ABC, span_id=123] = Complete Span

2. Completion Detection¶

Traces are considered complete when: - All spans have both start and end events - No new events for 5 seconds - Contains at least one span

3. Export Triggers¶

Traces are exported when: - Immediately: When completion is detected during event processing - Cleanup: Every minute, completed traces are found and exported - Expiry: After 5 minutes of inactivity (incomplete traces)

Export Formats¶

Console Exporter¶

Real-time trace visualization in terminal:

export TRACE_EXPORTER_TYPE=console
export TRACE_PRETTY_OUTPUT=true

Output Example:

🔗 TRACE a1b2c3d4 (285ms) - SUCCESS (3 spans across 2 agents)
  📍 Agent: weather-service
    ✅ tool:get_weather [get_weather] (150ms)
  📍 Agent: data-processor  
    ✅ tool:process_data [process_data] (100ms)
    ✅ tool:validate_result [validate_result] (35ms)

JSON Exporter¶

Structured export for external systems:

export TRACE_EXPORTER_TYPE=json
export TRACE_JSON_OUTPUT_DIR=/var/log/traces

Output Files: - /var/log/traces/trace-{trace_id}.json - One file per completed trace

Statistics Exporter¶

Aggregate metrics collection:

export TRACE_EXPORTER_TYPE=multi  # Enables all exporters
export TRACE_ENABLE_STATS=true

Query API¶

Trace Status¶

GET /trace/status

Returns tracing configuration and runtime statistics:

{
  "enabled": true,
  "consumer": {
    "stream_name": "mcp-mesh:traces",
    "consumer_group": "mcp-mesh-registry-processors",
    "status": "running"
  },
  "correlator": {
    "active_traces": 5,
    "total_spans": 12,
    "oldest_trace_age": "45s"
  },
  "exporter": {
    "type": "console",
    "exported_traces": 147
  }
}

List Recent Traces¶

GET /trace/list?limit=20&offset=0

Returns paginated list of completed traces, newest first.

Get Specific Trace¶

GET /trace/{trace_id}

Retrieve complete trace details by ID.

Search Traces¶

GET /trace/search?agent_name=weather&success=true&min_duration_ms=100

Search Parameters:

Parameter	Type	Description
`parent_span_id`	string	Filter by parent span
`agent_name`	string	Filter by agent name
`operation`	string	Filter by operation (partial match)
`success`	boolean	Filter by success status
`start_time`	RFC3339	Filter by start time (after)
`end_time`	RFC3339	Filter by end time (before)
`min_duration_ms`	integer	Minimum duration filter
`max_duration_ms`	integer	Maximum duration filter
`limit`	integer	Result limit (max 100)

Trace Statistics¶

GET /trace/stats

Returns aggregate statistics:

{
  "total_traces": 1250,
  "success_traces": 1189,
  "failed_traces": 61,
  "success_rate": 95.12,
  "avg_duration_ms": 234.5,
  "avg_spans_per_trace": 2.8,
  "agents_involved": ["weather", "data-processor", "report-gen"],
  "top_operations": [
    {"operation": "tool:get_weather", "count": 456},
    {"operation": "tool:process_data", "count": 389}
  ]
}

Performance Analysis Examples¶

Find Slow Operations¶

# Operations taking longer than 1 second
curl "http://localhost:8000/trace/search?min_duration_ms=1000&limit=10" | jq '.traces[] | {trace_id, duration, agents}'

Debug Failed Operations¶

# Get recent failures with details
curl "http://localhost:8000/trace/search?success=false&limit=5" | jq '.traces[] | {trace_id, agents, spans: [.spans[] | select(.success == false)]}'

Agent Performance Analysis¶

# Analyze specific agent performance
curl "http://localhost:8000/trace/search?agent_name=weather-service&limit=50" | jq '[.traces[].duration] | add / length'

Time-based Analysis¶

# Get traces from last hour
HOUR_AGO=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)
curl "http://localhost:8000/trace/search?start_time=$HOUR_AGO&limit=100" | jq '.traces | length'

Advanced Features¶

Context Propagation¶

Trace context automatically flows between agents:

# Parent agent
@mesh.tool(depends_on=["child-agent"])  
async def parent_operation():
    # trace_id and span_id automatically propagated
    child = await mesh.get_agent("child-agent")
    return await child.child_operation()

# Child agent  
@mesh.tool()
async def child_operation():
    # Inherits trace context from parent
    # New span created with parent span ID
    pass

Error Correlation¶

Failed operations are automatically correlated:

@mesh.tool()
async def failing_operation():
    try:
        # operation logic
        pass
    except Exception as e:
        # Error event automatically published with trace context
        raise  # Re-raise to maintain error handling

Multi-Agent Traces¶

Complex workflows spanning multiple agents are automatically traced:

User Request → Agent A → Agent B → Agent C
      ↓            ↓         ↓         ↓
   trace_id    same_id   same_id   same_id
   span_1      span_2    span_3    span_4
              parent=1  parent=2  parent=3

Storage and Retention¶

In-Memory Storage¶

Active Traces: Stored until completion or 5-minute timeout
Completed Traces: Last 1000 traces kept for querying
Automatic Cleanup: Oldest 20% removed when limit exceeded

Redis Stream Retention¶

# Configure Redis stream retention
redis-cli CONFIG SET stream-node-max-bytes 4096
redis-cli CONFIG SET stream-node-max-entries 100

# Manual stream cleanup (if needed)
redis-cli XTRIM mcp-mesh:traces MAXLEN ~ 10000

Troubleshooting¶

No Traces Appearing¶

Check tracing status:

curl http://localhost:8000/trace/status | jq .enabled

Verify Redis stream:

redis-cli XLEN mcp-mesh:traces
redis-cli XINFO GROUPS mcp-mesh:traces

Check agent connectivity:

# Python agents should log tracing status on startup
# Look for: "Tracing enabled, publishing to redis://..."

Incomplete Traces¶

Check for orphaned events:

redis-cli XREVRANGE mcp-mesh:traces + - COUNT 20

Monitor correlator status:

curl http://localhost:8000/trace/status | jq .correlator

Performance Issues¶

Monitor consumer lag:

redis-cli XINFO GROUPS mcp-mesh:traces
# Look for "lag" field in consumer info

Check memory usage:

curl http://localhost:8000/trace/stats | jq .
# Monitor active_traces count

Integration Examples¶

Prometheus Metrics¶

#!/bin/bash
# Export trace metrics to Prometheus

STATS=$(curl -s http://localhost:8000/trace/stats)
SUCCESS_RATE=$(echo $STATS | jq .success_rate)
AVG_DURATION=$(echo $STATS | jq .avg_duration_ms)

echo "mcp_mesh_trace_success_rate $SUCCESS_RATE" | curl -X POST --data-binary @- http://pushgateway:9091/metrics/job/mcp-mesh
echo "mcp_mesh_trace_avg_duration_ms $AVG_DURATION" | curl -X POST --data-binary @- http://pushgateway:9091/metrics/job/mcp-mesh

External APM Integration¶

#!/bin/bash
# Send traces to external APM (e.g., Datadog, New Relic)

curl -s "http://localhost:8000/trace/list?limit=100" | \
  jq -c '.traces[]' | \
  while read trace; do
    curl -X POST "https://api.datadoghq.com/api/v1/traces" \
      -H "DD-API-KEY: $DD_API_KEY" \
      -H "Content-Type: application/json" \
      -d "$trace"
  done

Log Correlation¶

#!/bin/bash
# Correlate traces with application logs

# Extract trace IDs and search logs
curl -s "http://localhost:8000/trace/search?success=false&limit=10" | \
  jq -r '.traces[].trace_id' | \
  while read trace_id; do
    echo "=== Logs for trace $trace_id ==="
    grep "$trace_id" /var/log/mcp-mesh/*.log
  done

Best Practices¶

1. Monitoring¶

Set up alerts on trace export failures
Monitor trace completion rates
Track trace duration trends
Alert on error rate spikes

2. Performance¶

Use multi exporter for comprehensive observability
Configure appropriate Redis retention policies
Monitor correlator memory usage
Tune batch sizes for high throughput

3. Debugging¶

Use search API for targeted investigation
Correlate traces with application logs
Monitor Redis stream health
Check agent trace context propagation

4. Production Deployment¶

Configure JSON export for trace persistence
Set up external metrics collection
Implement trace sampling for high-volume systems
Monitor registry resource usage

Performance Characteristics¶

Throughput: 10,000+ spans/second sustained
Latency: <1ms trace event publishing (async)
Memory: ~1MB per 1000 completed traces
Storage: Configurable retention in Redis and memory
Correlation: Real-time span correlation and export
Availability: Registry failure doesn't impact agents

Next Steps¶

The distributed tracing system provides comprehensive observability out of the box. Consider extending with:

Custom Exporters: Implement organization-specific backends
Trace Sampling: Add intelligent sampling for high-volume scenarios
SLA Monitoring: Extract SLA metrics from trace data
Automated Alerting: Set up proactive monitoring based on trace patterns

💡 Tip: Use the trace search API with time windows to identify performance trends and system bottlenecks

📊 Performance: Monitor trace statistics regularly to ensure optimal system performance

🔗 Integration: Export traces to your existing observability stack using JSON exporter or custom exporters