Grafana Dashboards¶
Visualize and analyze MCP Mesh metrics with powerful, customizable dashboards
Overview¶
Grafana provides rich visualization capabilities for monitoring MCP Mesh deployments. This guide covers creating comprehensive dashboards, implementing drill-down navigation, setting up variables for dynamic filtering, and sharing dashboards across teams. You'll learn to build dashboards that provide actionable insights into agent performance, system health, and business metrics.
Well-designed Grafana dashboards transform raw metrics into meaningful visualizations that enable quick decision-making and proactive issue resolution.
Key Concepts¶
- Dashboard Organization: Logical grouping and navigation
- Panel Types: Time series, stat, gauge, heatmap, logs
- Variables: Dynamic filtering and drill-down
- Annotations: Correlating events with metrics
- Alerting: Visual alerts and notifications
Step-by-Step Guide¶
Step 1: Access and Configure Grafana¶
Connect to Grafana and configure data sources:
# Port forward to Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Default credentials (from kube-prometheus-stack)
# Username: admin
# Password: prom-operator
# Or get password from secret
kubectl get secret -n monitoring prometheus-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode
Configure Prometheus data source:
# datasource-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
prometheus.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-operated:9090
isDefault: true
jsonData:
timeInterval: 30s
queryTimeout: 60s
httpMethod: POST
# Enable exemplars for trace correlation
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
Step 2: Create MCP Mesh Overview Dashboard¶
Create a comprehensive overview dashboard:
{
"dashboard": {
"title": "MCP Mesh Overview",
"uid": "mcp-mesh-overview",
"description": "High-level overview of MCP Mesh platform health and performance",
"tags": ["mcp-mesh", "overview"],
"timezone": "browser",
"schemaVersion": 30,
"version": 1,
"refresh": "30s",
"variables": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(mcp_mesh_requests_total, namespace)",
"refresh": 1,
"multi": false,
"includeAll": true,
"allValue": ".*"
},
{
"name": "agent",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(mcp_mesh_requests_total{namespace=~\"$namespace\"}, agent)",
"refresh": 1,
"multi": true,
"includeAll": true,
"allValue": ".*"
},
{
"name": "interval",
"type": "interval",
"options": [
{ "text": "1m", "value": "1m" },
{ "text": "5m", "value": "5m" },
{ "text": "10m", "value": "10m" },
{ "text": "30m", "value": "30m" },
{ "text": "1h", "value": "1h" }
],
"current": {
"text": "5m",
"value": "5m"
}
}
]
},
"panels": [
{
"title": "System Health Score",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
"targets": [
{
"expr": "avg(mcp_mesh:registry:health_score)",
"refId": "A"
}
],
"options": {
"reduceOptions": {
"calcs": ["lastNotNull"]
},
"colorMode": "background",
"graphMode": "none",
"orientation": "horizontal"
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 80 },
{ "color": "green", "value": 95 }
]
}
}
}
},
{
"title": "Active Agents",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
"targets": [
{
"expr": "count(up{job=~\"mcp-mesh.*\", namespace=~\"$namespace\"} == 1)",
"refId": "A"
}
],
"options": {
"colorMode": "value",
"graphMode": "area",
"orientation": "horizontal"
}
},
{
"title": "Total Request Rate",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"targets": [
{
"expr": "sum(rate(mcp_mesh_requests_total{namespace=~\"$namespace\", agent=~\"$agent\"}[$interval]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"decimals": 2
}
}
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 },
"targets": [
{
"expr": "sum(rate(mcp_mesh_requests_total{namespace=~\"$namespace\", agent=~\"$agent\", status=\"error\"}[$interval])) / sum(rate(mcp_mesh_requests_total{namespace=~\"$namespace\", agent=~\"$agent\"}[$interval])) * 100",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"decimals": 2,
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
},
{
"title": "Request Rate by Agent",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"targets": [
{
"expr": "sum by (agent) (rate(mcp_mesh_requests_total{namespace=~\"$namespace\", agent=~\"$agent\"}[$interval]))",
"legendFormat": "{% raw %}{{agent}}{% endraw %}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"lineWidth": 2,
"fillOpacity": 10,
"gradientMode": "opacity",
"spanNulls": false,
"showPoints": "never",
"stacking": {
"mode": "none"
}
}
}
}
},
{
"title": "Response Time Heatmap",
"type": "heatmap",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"targets": [
{
"expr": "sum by (le) (increase(mcp_mesh_request_duration_seconds_bucket{namespace=~\"$namespace\", agent=~\"$agent\"}[$interval]))",
"format": "heatmap",
"refId": "A"
}
],
"options": {
"calculate": false,
"yAxis": {
"unit": "s",
"decimals": 2
},
"cellGap": 1,
"colorScheme": "interpolateSpectral"
}
}
]
}
}
Step 3: Create Agent-Specific Dashboard¶
Build detailed dashboards for individual agents:
{
"dashboard": {
"title": "MCP Mesh Agent Details",
"uid": "mcp-mesh-agent-details",
"description": "Detailed metrics for individual MCP Mesh agents",
"panels": [
{
"title": "Agent Info",
"type": "table",
"gridPos": { "h": 4, "w": 24, "x": 0, "y": 0 },
"targets": [
{
"expr": "mcp_mesh_agent{agent=\"$agent\"}",
"format": "table",
"instant": true,
"refId": "A"
}
],
"transformations": [
{
"id": "filterFieldsByName",
"options": {
"include": {
"names": ["agent_name", "version", "capabilities", "Value"]
}
}
}
]
},
{
"title": "Request Latency Percentiles",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"targets": [
{
"expr": "histogram_quantile(0.50, sum by (le) (rate(mcp_mesh_request_duration_seconds_bucket{agent=\"$agent\"}[$interval])))",
"legendFormat": "p50",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(mcp_mesh_request_duration_seconds_bucket{agent=\"$agent\"}[$interval])))",
"legendFormat": "p95",
"refId": "B"
},
{
"expr": "histogram_quantile(0.99, sum by (le) (rate(mcp_mesh_request_duration_seconds_bucket{agent=\"$agent\"}[$interval])))",
"legendFormat": "p99",
"refId": "C"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": {
"axisLabel": "Response Time",
"drawStyle": "line",
"lineWidth": 2,
"fillOpacity": 0
}
},
"overrides": [
{
"matcher": { "id": "byName", "options": "p99" },
"properties": [
{
"id": "color",
"value": { "mode": "fixed", "fixedColor": "red" }
}
]
}
]
}
},
{
"title": "Request Types Distribution",
"type": "piechart",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"targets": [
{
"expr": "sum by (method) (increase(mcp_mesh_requests_total{agent=\"$agent\"}[$interval]))",
"legendFormat": "{% raw %}{{method}}{% endraw %}",
"refId": "A"
}
],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"pieType": "donut",
"displayLabels": ["name", "percent"],
"legendDisplayMode": "table",
"legendPlacement": "right"
}
},
{
"title": "Active Connections",
"type": "graph",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
"targets": [
{
"expr": "mcp_mesh_connections_active{agent=\"$agent\"}",
"legendFormat": "{% raw %}{{type}}{% endraw %}",
"refId": "A"
}
],
"yaxes": [
{
"label": "Connections",
"format": "short",
"min": 0
}
]
},
{
"title": "Resource Usage",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 12 },
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total{pod=~\"$agent.*\"}[$interval]) * 100",
"legendFormat": "CPU %",
"refId": "A"
},
{
"expr": "container_memory_working_set_bytes{pod=~\"$agent.*\"} / 1024 / 1024",
"legendFormat": "Memory (MB)",
"refId": "B"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"drawStyle": "line",
"lineWidth": 2,
"fillOpacity": 10
}
},
"overrides": [
{
"matcher": { "id": "byName", "options": "CPU %" },
"properties": [
{ "id": "unit", "value": "percent" },
{ "id": "custom.axisPlacement", "value": "left" }
]
},
{
"matcher": { "id": "byName", "options": "Memory (MB)" },
"properties": [
{ "id": "unit", "value": "decmbytes" },
{ "id": "custom.axisPlacement", "value": "right" }
]
}
]
}
}
]
}
}
Step 4: Create Business Metrics Dashboard¶
Visualize business-specific KPIs:
{
"dashboard": {
"title": "MCP Mesh Business Metrics",
"uid": "mcp-mesh-business",
"description": "Business KPIs and analytics for MCP Mesh",
"panels": [
{
"title": "Revenue by Agent",
"type": "bargauge",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum by (agent) (increase(mcp_mesh_business_revenue_total[$__range]))",
"legendFormat": "{% raw %}{{agent}}{% endraw %}",
"refId": "A"
}
],
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"fieldConfig": {
"defaults": {
"unit": "currencyUSD",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 10000 },
{ "color": "red", "value": 50000 }
]
}
}
}
},
{
"title": "API Usage Costs",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
"targets": [
{
"expr": "sum by (api_provider) (increase(mcp_mesh_business_api_calls_total[$interval]) * 0.001)",
"legendFormat": "{% raw %}{{api_provider}}{% endraw %} ($0.001/call)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "currencyUSD",
"custom": {
"stacking": {
"mode": "normal"
},
"fillOpacity": 50
}
}
}
},
{
"title": "Cache Performance",
"type": "gauge",
"gridPos": { "h": 8, "w": 8, "x": 0, "y": 8 },
"targets": [
{
"expr": "avg(mcp_mesh_cache_hit_rate)",
"refId": "A"
}
],
"options": {
"orientation": "auto",
"showThresholdLabels": true,
"showThresholdMarkers": true
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 60 },
{ "color": "green", "value": 80 }
]
}
}
}
},
{
"title": "SLO Compliance",
"type": "stat",
"gridPos": { "h": 8, "w": 8, "x": 8, "y": 8 },
"targets": [
{
"expr": "avg(mcp_mesh:slo:availability) * 100",
"refId": "A"
}
],
"options": {
"colorMode": "background",
"graphMode": "none",
"orientation": "horizontal"
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"decimals": 3,
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 99 },
{ "color": "green", "value": 99.9 }
]
}
}
}
},
{
"title": "Cost per Transaction",
"type": "timeseries",
"gridPos": { "h": 8, "w": 8, "x": 16, "y": 8 },
"targets": [
{
"expr": "(sum(rate(container_cpu_usage_seconds_total{namespace=\"mcp-mesh\"}[$interval])) * 0.05 + sum(container_memory_working_set_bytes{namespace=\"mcp-mesh\"}) / 1024 / 1024 / 1024 * 0.01) / sum(rate(mcp_mesh_requests_total[$interval]))",
"legendFormat": "Cost per transaction",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "currencyUSD",
"decimals": 6,
"custom": {
"drawStyle": "line",
"lineWidth": 2,
"fillOpacity": 20,
"gradientMode": "opacity"
}
}
}
}
]
}
}
Step 5: Implement Dynamic Dashboards¶
Create dashboards with advanced features:
# dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-dynamic
namespace: monitoring
data:
dynamic-dashboard.json: |
{
"dashboard": {
"title": "MCP Mesh Dynamic Analysis",
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"current": {
"text": "Prometheus",
"value": "Prometheus"
}
},
{
"name": "agent_regex",
"type": "textbox",
"current": {
"text": ".*",
"value": ".*"
},
"label": "Agent Filter (regex)"
},
{
"name": "percentile",
"type": "custom",
"current": {
"text": "0.95",
"value": "0.95"
},
"options": [
{"text": "p50", "value": "0.5"},
{"text": "p90", "value": "0.9"},
{"text": "p95", "value": "0.95"},
{"text": "p99", "value": "0.99"}
]
}
]
},
"annotations": {
"list": [
{
"datasource": "Prometheus",
"enable": true,
"expr": "changes(mcp_mesh_agent{agent=~\"$agent_regex\"}[5m]) > 0",
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Agent Restarts",
"tagKeys": "agent,version"
},
{
"datasource": "Prometheus",
"enable": true,
"expr": "ALERTS{alertstate=\"firing\",namespace=\"mcp-mesh\"}",
"iconColor": "rgba(255, 96, 96, 1)",
"name": "Active Alerts",
"tagKeys": "alertname,severity"
}
]
},
"links": [
{
"title": "Drill Down",
"type": "dashboards",
"tags": ["mcp-mesh", "agent"],
"includeVars": true,
"keepTime": true
},
{
"title": "View in Jaeger",
"type": "link",
"url": "http://jaeger:16686/search?service=${agent}&start=${__from}&end=${__to}",
"targetBlank": true
}
],
"panels": [
{
"title": "Dynamic Latency Analysis",
"type": "graph",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 0},
"targets": [
{
"expr": "histogram_quantile($percentile, sum by (agent, le) (rate(mcp_mesh_request_duration_seconds_bucket{agent=~\"$agent_regex\"}[$interval])))",
"legendFormat": "{% raw %}{{agent}}{% endraw %} - p${percentile:raw}",
"refId": "A"
}
],
"options": {
"dataLinks": [
{
"title": "View traces",
"url": "/explore?left={\"datasource\":\"Tempo\",\"queries\":[{\"query\":\"agent=${__series.labels.agent}\"}],\"range\":{\"from\":\"${__value.time}\",\"to\":\"${__value.time}\"}}"
}
]
}
},
{
"title": "Adaptive Thresholds",
"type": "timeseries",
"gridPos": {"h": 10, "w": 24, "x": 0, "y": 10},
"targets": [
{
"expr": "mcp_mesh:request_rate{agent=~\"$agent_regex\"}",
"legendFormat": "{% raw %}{{agent}}{% endraw %} - actual",
"refId": "A"
},
{
"expr": "predict_linear(mcp_mesh:request_rate{agent=~\"$agent_regex\"}[1h], 3600)",
"legendFormat": "{% raw %}{{agent}}{% endraw %} - predicted",
"refId": "B"
},
{
"expr": "mcp_mesh:request_rate{agent=~\"$agent_regex\"} + 2 * stddev_over_time(mcp_mesh:request_rate{agent=~\"$agent_regex\"}[1h])",
"legendFormat": "{% raw %}{{agent}}{% endraw %} - upper bound",
"refId": "C"
}
],
"fieldConfig": {
"overrides": [
{
"matcher": {"id": "byRegexp", "options": ".*predicted.*"},
"properties": [
{
"id": "custom.lineStyle",
"value": {"fill": "dash", "dash": [10, 10]}
}
]
},
{
"matcher": {"id": "byRegexp", "options": ".*upper bound.*"},
"properties": [
{
"id": "custom.lineStyle",
"value": {"fill": "dot", "dash": [2, 5]}
},
{
"id": "color",
"value": {"mode": "fixed", "fixedColor": "red"}
}
]
}
]
}
}
]
}
}
Step 6: Set Up Dashboard Provisioning¶
Automate dashboard deployment:
# dashboard-provisioning.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-provider
namespace: monitoring
data:
dashboards.yaml: |
apiVersion: 1
providers:
- name: 'MCP Mesh Dashboards'
orgId: 1
folder: 'MCP Mesh'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/mcp-mesh
- name: 'MCP Mesh Generated'
orgId: 1
folder: 'MCP Mesh - Auto'
type: file
disableDeletion: true
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards/generated
---
# Script to generate dashboards dynamically
apiVersion: v1
kind: ConfigMap
metadata:
name: dashboard-generator
namespace: monitoring
data:
generate.py: |
#!/usr/bin/env python3
import json
import os
from prometheus_api_client import PrometheusConnect
# Connect to Prometheus
prom = PrometheusConnect(url="http://prometheus-operated:9090")
# Get all agents
agents = prom.custom_query('group by (agent) (mcp_mesh_requests_total)')
# Generate dashboard for each agent
for agent_data in agents:
agent = agent_data['metric']['agent']
dashboard = {
"dashboard": {
"title": f"MCP Mesh - {agent}",
"uid": f"mcp-mesh-auto-{agent}",
"tags": ["mcp-mesh", "auto-generated", agent],
"panels": generate_panels_for_agent(agent)
}
}
# Save dashboard
with open(f'/var/lib/grafana/dashboards/generated/{agent}.json', 'w') as f:
json.dump(dashboard, f, indent=2)
def generate_panels_for_agent(agent):
return [
{
"title": f"{agent} - Request Rate",
"type": "graph",
"targets": [
{
"expr": f'rate(mcp_mesh_requests_total{% raw %}{{agent="{agent}"}}{% endraw %}[5m])',
"refId": "A"
}
]
}
# Add more panels...
]
Configuration Options¶
Feature | Configuration | Purpose |
---|---|---|
Variables | templating.list | Dynamic filtering |
Annotations | annotations.list | Event markers |
Links | links | Navigation between dashboards |
Alerts | alert | Visual alert rules |
Transforms | transformations | Data manipulation |
Examples¶
Example 1: Multi-Cluster Dashboard¶
{
"dashboard": {
"title": "MCP Mesh Multi-Cluster View",
"panels": [
{
"title": "Cluster Comparison",
"type": "table",
"gridPos": { "h": 10, "w": 24, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum by (cluster, agent) (rate(mcp_mesh_requests_total[5m]))",
"format": "table",
"instant": true,
"refId": "A"
}
],
"transformations": [
{
"id": "pivot",
"options": {
"pivotField": "cluster",
"valueField": "Value",
"groupByField": "agent"
}
}
],
"fieldConfig": {
"defaults": {
"custom": {
"displayMode": "color-background",
"colorMode": "value"
},
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 50 }
]
}
}
}
}
]
}
}
Example 2: SLO Dashboard¶
{
"dashboard": {
"title": "MCP Mesh SLO Tracking",
"panels": [
{
"title": "Error Budget Remaining",
"type": "gauge",
"targets": [
{
"expr": "(1 - ((1 - avg(mcp_mesh:slo:availability)) / (1 - 0.999))) * 100",
"refId": "A"
}
],
"options": {
"showThresholdLabels": true,
"showThresholdMarkers": true,
"text": {
"titleSize": 24,
"valueSize": 48
}
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 25 },
{ "color": "green", "value": 50 }
]
}
}
}
},
{
"title": "SLO Burn Rate",
"type": "timeseries",
"targets": [
{
"expr": "1 - mcp_mesh:slo:availability",
"legendFormat": "Current burn rate",
"refId": "A"
},
{
"expr": "0.001",
"legendFormat": "SLO threshold (99.9%)",
"refId": "B"
}
]
}
]
}
}
Best Practices¶
- Organize Dashboards: Use folders and tags
- Use Variables: Enable dynamic filtering
- Add Documentation: Include panel descriptions
- Set Refresh Rates: Balance freshness vs load
- Export/Import: Version control dashboards
Common Pitfalls¶
Pitfall 1: Overwhelming Dashboards¶
Problem: Too many panels, hard to understand
Solution: Create focused dashboards:
{
"dashboard": {
"title": "MCP Mesh - Quick Health",
"description": "High-level health indicators only",
"panels": [
// Limit to 6-8 key metrics
]
}
}
Pitfall 2: Slow Queries¶
Problem: Dashboard takes forever to load
Solution: Use recording rules:
# Instead of complex query in dashboard
histogram_quantile(0.95,
sum by (le) (
rate(mcp_mesh_request_duration_seconds_bucket[5m])
)
)
# Use pre-computed recording rule
mcp_mesh:request_duration:p95
Testing¶
Validate Dashboard JSON¶
# test_dashboards.py
import json
import glob
def test_dashboard_validity():
"""Validate all dashboard JSON files"""
for dashboard_file in glob.glob("dashboards/*.json"):
with open(dashboard_file) as f:
dashboard = json.load(f)
# Check required fields
assert "dashboard" in dashboard
assert "title" in dashboard["dashboard"]
assert "panels" in dashboard["dashboard"]
# Check panels
for panel in dashboard["dashboard"]["panels"]:
assert "type" in panel
assert "gridPos" in panel
assert "targets" in panel
def test_dashboard_queries():
"""Validate Prometheus queries in dashboards"""
from prometheus_api_client import PrometheusConnect
prom = PrometheusConnect(url="http://localhost:9090")
for dashboard_file in glob.glob("dashboards/*.json"):
with open(dashboard_file) as f:
dashboard = json.load(f)
for panel in dashboard["dashboard"]["panels"]:
for target in panel.get("targets", []):
if "expr" in target:
# Test query syntax
try:
prom.custom_query(target["expr"])
except Exception as e:
raise AssertionError(
f"Invalid query in {dashboard_file}: {target['expr']}"
) from e
Performance Testing¶
#!/bin/bash
# test-dashboard-performance.sh
GRAFANA_URL="http://localhost:3000"
DASHBOARD_UID="mcp-mesh-overview"
# Measure dashboard load time
time curl -s -o /dev/null \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
"$GRAFANA_URL/api/dashboards/uid/$DASHBOARD_UID"
# Check panel query performance
curl -s \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
"$GRAFANA_URL/api/dashboards/uid/$DASHBOARD_UID" | \
jq -r '.dashboard.panels[].targets[].expr' | \
while read -r query; do
echo "Testing query: $query"
time curl -s -o /dev/null --data-urlencode "query=$query" \
"http://prometheus:9090/api/v1/query"
done
Monitoring and Debugging¶
Monitor Grafana Performance¶
# grafana-monitoring.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-monitoring-dashboard
namespace: monitoring
data:
grafana-performance.json: |
{
"dashboard": {
"title": "Grafana Performance",
"panels": [
{
"title": "Dashboard Load Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, grafana_api_dashboard_get_milliseconds_bucket)",
"legendFormat": "p95 load time"
}
]
},
{
"title": "Active Users",
"type": "stat",
"targets": [
{
"expr": "grafana_stat_active_users"
}
]
}
]
}
}
Debug Dashboard Issues¶
# Enable debug logging
kubectl set env deployment/prometheus-grafana -n monitoring \
GF_LOG_LEVEL=debug
# View Grafana logs
kubectl logs -n monitoring deployment/prometheus-grafana -f
# Check dashboard provisioning
kubectl exec -n monitoring deployment/prometheus-grafana -- \
ls -la /var/lib/grafana/dashboards/
# Test data source connection
kubectl exec -n monitoring deployment/prometheus-grafana -- \
curl -s http://prometheus-operated:9090/api/v1/query?query=up
๐ง Troubleshooting¶
Issue 1: Dashboard Not Loading¶
Symptoms: Dashboard shows "No Data" or loading spinner
Cause: Data source misconfiguration or query errors
Solution:
# Check data source configuration
curl -s -H "Authorization: Bearer $API_KEY" \
http://localhost:3000/api/datasources
# Test query directly in Prometheus
curl -G http://localhost:9090/api/v1/query \
--data-urlencode 'query=mcp_mesh_requests_total'
# Check time range
# Ensure data exists for selected time range
Issue 2: Variables Not Working¶
Symptoms: Template variables show "None" or don't filter
Cause: Incorrect query or label values
Solution:
{
"templating": {
"list": [
{
"name": "agent",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{job=~\"mcp-mesh.*\"}, instance)",
"refresh": 2, // on time range change
"sort": 1 // alphabetical
}
]
}
}
For more issues, see the section troubleshooting guide.
โ ๏ธ Known Limitations¶
- Panel Limit: Performance degrades with >50 panels
- Query Timeout: Default 30s timeout for queries
- Variable Cardinality: High cardinality slows dropdown
- Annotation Limit: Too many annotations impact performance
๐ TODO¶
- Add dashboard templating with Jsonnet
- Create mobile-responsive dashboards
- Implement dashboard versioning
- Add automated screenshot testing
- Create dashboard marketplace
Summary¶
You now have comprehensive Grafana dashboards for MCP Mesh:
Key takeaways:
- ๐ Multi-level dashboards from overview to details
- ๐ Dynamic filtering with variables
- ๐ Performance optimization techniques
- ๐ Automated provisioning and generation
Next Steps¶
Let's add distributed tracing to correlate with metrics.
Continue to Distributed Tracing โ
๐ก Tip: Use Grafana's built-in explore mode to test queries before adding to dashboards
๐ Reference: Grafana Best Practices
๐งช Try It: Create a custom dashboard for your specific use case using the examples as templates