Troubleshooting K8s Deployments¶
Comprehensive guide to diagnosing and fixing common MCP Mesh deployment issues on Kubernetes
Overview¶
This troubleshooting guide addresses the most common issues encountered when deploying MCP Mesh on Kubernetes. Each issue includes symptoms, diagnostic steps, root cause analysis, and proven solutions. We'll cover pod failures, networking issues, storage problems, and performance bottlenecks.
Quick Diagnostics¶
Run this comprehensive diagnostic script:
#!/bin/bash
# mcp-mesh-k8s-diagnostics.sh
NAMESPACE=${1:-mcp-mesh}
echo "MCP Mesh Kubernetes Diagnostics for namespace: $NAMESPACE"
echo "======================================================="
# Check namespace exists
echo -e "\n1. Checking namespace..."
kubectl get namespace $NAMESPACE || {
echo "ERROR: Namespace $NAMESPACE not found"
exit 1
}
# Check pods
echo -e "\n2. Pod Status:"
kubectl get pods -n $NAMESPACE -o wide
echo -e "\nProblematic pods:"
kubectl get pods -n $NAMESPACE --field-selector=status.phase!=Running,status.phase!=Succeeded
# Check services
echo -e "\n3. Service Status:"
kubectl get svc -n $NAMESPACE
echo -e "\nService endpoints:"
kubectl get endpoints -n $NAMESPACE
# Check registry
echo -e "\n4. Registry Status:"
kubectl get statefulset,pod,svc -n $NAMESPACE -l app.kubernetes.io/name=mcp-mesh-registry
# Check events
echo -e "\n5. Recent Events:"
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20
# Check resource usage
echo -e "\n6. Resource Usage:"
kubectl top nodes
kubectl top pods -n $NAMESPACE
# Check persistent volumes
echo -e "\n7. Storage:"
kubectl get pvc -n $NAMESPACE
# Network connectivity test
echo -e "\n8. Network Test:"
kubectl run test-network --rm -it --image=busybox --restart=Never -n $NAMESPACE -- \
sh -c "nslookup mcp-mesh-registry && echo 'DNS OK' || echo 'DNS FAILED'"
Common Issues and Solutions¶
Issue 1: Pods Stuck in Pending State¶
Symptoms:
NAME READY STATUS RESTARTS AGE
mcp-mesh-registry-0 0/1 Pending 0 5m
weather-agent-abc123 0/1 Pending 0 3m
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n mcp-mesh
# Check node resources
kubectl describe nodes
kubectl top nodes
# Check PVC status
kubectl get pvc -n mcp-mesh
Common Causes and Solutions:
- Insufficient Resources
# Check resource requests
kubectl describe pod <pod-name> -n mcp-mesh | grep -A10 Requests
# Solution: Scale down other pods or add nodes
kubectl scale deployment <other-deployment> --replicas=0 -n mcp-mesh
# Or reduce resource requests
kubectl patch deployment <deployment-name> -n mcp-mesh -p '
{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "agent",
"resources": {
"requests": {
"cpu": "50m",
"memory": "64Mi"
}
}
}]
}
}
}
}'
- PVC Not Bound
# Check PVC status
kubectl get pvc -n mcp-mesh
# Check available storage classes
kubectl get storageclass
# Create PVC with correct storage class
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: registry-data
namespace: mcp-mesh
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: standard # Use available class
resources:
requests:
storage: 5Gi
EOF
- Node Selector/Affinity Not Satisfied
# Check node labels
kubectl get nodes --show-labels
# Remove node selector temporarily
kubectl patch deployment <deployment-name> -n mcp-mesh --type='json' -p='[
{"op": "remove", "path": "/spec/template/spec/nodeSelector"}
]'
Issue 2: Pods in CrashLoopBackOff¶
Symptoms:
Diagnosis:
# Check logs from current run
kubectl logs <pod-name> -n mcp-mesh
# Check logs from previous run
kubectl logs <pod-name> -n mcp-mesh --previous
# Check container exit code
kubectl describe pod <pod-name> -n mcp-mesh | grep -A10 "Last State"
Common Causes and Solutions:
- Missing Environment Variables
# Check current env vars
kubectl exec <pod-name> -n mcp-mesh -- env
# Add missing variables
kubectl set env deployment/<deployment-name> \
MCP_MESH_REGISTRY_URL=http://mcp-mesh-registry:8000 \
-n mcp-mesh
- Registry Connection Failed
# Add init container to wait for registry
spec:
initContainers:
- name: wait-for-registry
image: busybox:1.35
command: ["sh", "-c"]
args:
- |
until nc -z mcp-mesh-registry 8000; do
echo "Waiting for registry..."
sleep 2
done
- Permission Errors
Issue 3: Service Discovery Not Working¶
Symptoms:
- Agents can't find registry
- "connection refused" errors
- DNS resolution failures
Diagnosis:
# Test DNS from pod
kubectl exec -it <pod-name> -n mcp-mesh -- nslookup mcp-mesh-registry
# Check service endpoints
kubectl get endpoints mcp-mesh-registry -n mcp-mesh
# Test connectivity
kubectl exec -it <pod-name> -n mcp-mesh -- wget -O- http://mcp-mesh-registry:8000/health
Solutions:
- DNS Issues
- Service Selector Mismatch
# Verify labels match
kubectl get svc mcp-mesh-registry -o yaml | grep -A5 selector
kubectl get pods -l app.kubernetes.io/name=mcp-mesh-registry --show-labels
- Network Policy Blocking
# Check network policies
kubectl get networkpolicy -n mcp-mesh
# Temporarily disable
kubectl delete networkpolicy --all -n mcp-mesh
Issue 4: High Memory/CPU Usage¶
Symptoms:
- Pods getting OOMKilled
- Slow response times
- Node pressure
Diagnosis:
# Check resource usage
kubectl top pods -n mcp-mesh
kubectl describe pod <pod-name> -n mcp-mesh | grep -A20 Containers
# Check for memory leaks
kubectl exec <pod-name> -n mcp-mesh -- ps aux
Solutions:
- Increase Resource Limits
- Enable Horizontal Pod Autoscaling
- Optimize Application
Issue 5: Persistent Volume Issues¶
Symptoms:
- Data loss after pod restart
- Permission denied errors
- Disk full errors
Diagnosis:
# Check PVC status
kubectl get pvc -n mcp-mesh
kubectl describe pvc <pvc-name> -n mcp-mesh
# Check disk usage in pod
kubectl exec <pod-name> -n mcp-mesh -- df -h
Solutions:
- Expand PVC
# For expandable storage classes
kubectl patch pvc <pvc-name> -n mcp-mesh -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'
- Fix Permissions
Issue 6: Image Pull Errors¶
Symptoms:
Failed to pull image "mcpmesh/python-runtime:0.5": rpc error: code = Unknown desc = Error response from daemon: pull access denied
Solutions:
- For Minikube Local Images
# Use Minikube's Docker
eval $(minikube docker-env)
docker build -t mcp-mesh/agent:0.2 .
# Set imagePullPolicy
kubectl patch deployment <deployment-name> -n mcp-mesh -p '
{
"spec": {
"template": {
"spec": {
"containers": [{
"name": "agent",
"imagePullPolicy": "Never"
}]
}
}
}
}'
- For Private Registry
# Create pull secret
kubectl create secret docker-registry regcred \
--docker-server=myregistry.io \
--docker-username=user \
--docker-password=pass \
--docker-email=email@example.com \
-n mcp-mesh
# Add to deployment
kubectl patch deployment <deployment-name> -n mcp-mesh -p '
{
"spec": {
"template": {
"spec": {
"imagePullSecrets": [{"name": "regcred"}]
}
}
}
}'
Performance Troubleshooting¶
Slow Agent Startup¶
Diagnosis:
# Check startup time
kubectl logs <pod-name> -n mcp-mesh | grep -E "started|ready"
# Profile startup
kubectl exec <pod-name> -n mcp-mesh -- python -m cProfile -o profile.stats agent.py
Solutions:
- Add startup probe with longer timeout
- Optimize imports and initialization
- Use init containers for pre-warming
High Latency Between Agents¶
Diagnosis:
# Test network latency
kubectl exec -it <pod-name> -n mcp-mesh -- ping <other-pod-ip>
# Check service mesh metrics (if using Istio)
kubectl exec -it <pod-name> -c istio-proxy -n mcp-mesh -- curl localhost:15000/stats/prometheus
Solutions:
- Use node affinity to colocate related agents
- Enable pod topology spread constraints
- Optimize serialization/deserialization
Debugging Tools and Commands¶
Essential kubectl Commands¶
# Get comprehensive pod info
kubectl get pod <pod-name> -n mcp-mesh -o yaml
# Watch pod status changes
kubectl get pods -n mcp-mesh -w
# Get all resources in namespace
kubectl get all -n mcp-mesh
# Describe problematic resources
kubectl describe pod/deployment/service <name> -n mcp-mesh
# Check RBAC permissions
kubectl auth can-i --list --namespace=mcp-mesh
Advanced Debugging¶
# Enable verbose logging
kubectl set env deployment/<deployment-name> LOG_LEVEL=DEBUG -n mcp-mesh
# Port forward for direct access
kubectl port-forward pod/<pod-name> 8080:8080 -n mcp-mesh
# Copy files from pod
kubectl cp <pod-name>:/path/to/file ./local-file -n mcp-mesh
# Run debug container
kubectl debug <pod-name> -it --image=busybox -n mcp-mesh
Monitoring Commands¶
# Real-time resource monitoring
watch -n 2 'kubectl top pods -n mcp-mesh'
# Check cluster events
kubectl get events -n mcp-mesh --sort-by='.lastTimestamp' -w
# View audit logs (if enabled)
kubectl logs -n kube-system -l component=kube-apiserver | grep mcp-mesh
Recovery Procedures¶
Emergency Pod Recovery¶
#!/bin/bash
# emergency-recovery.sh
NAMESPACE=mcp-mesh
echo "Starting emergency recovery..."
# Delete stuck pods
kubectl delete pods --field-selector=status.phase=Failed -n $NAMESPACE
kubectl delete pods --field-selector=status.phase=Unknown -n $NAMESPACE
# Restart all deployments
kubectl rollout restart deployment -n $NAMESPACE
# Force delete stuck PVCs
kubectl patch pvc <pvc-name> -n $NAMESPACE -p '{"metadata":{"finalizers":null}}'
# Reset failed jobs
kubectl delete jobs --field-selector=status.successful=0 -n $NAMESPACE
echo "Recovery complete. Checking status..."
kubectl get all -n $NAMESPACE
Data Recovery¶
# Backup registry data
kubectl exec mcp-mesh-registry-0 -n mcp-mesh -- \
tar czf /tmp/backup.tar.gz /data
kubectl cp mcp-mesh-registry-0:/tmp/backup.tar.gz ./registry-backup.tar.gz -n mcp-mesh
# Restore registry data
kubectl cp ./registry-backup.tar.gz mcp-mesh-registry-0:/tmp/backup.tar.gz -n mcp-mesh
kubectl exec mcp-mesh-registry-0 -n mcp-mesh -- \
tar xzf /tmp/backup.tar.gz -C /
Prevention Strategies¶
Resource Management¶
# Set up ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
name: mcp-mesh-quota
namespace: mcp-mesh
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
persistentvolumeclaims: "10"
pods: "50"
Pod Disruption Budgets¶
# Ensure availability during updates
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: registry-pdb
namespace: mcp-mesh
spec:
minAvailable: 2
selector:
matchLabels:
app.kubernetes.io/name: mcp-mesh-registry
Monitoring Setup¶
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mcp-mesh-agents
namespace: mcp-mesh
spec:
selector:
matchLabels:
app.kubernetes.io/component: agent
endpoints:
- port: metrics
interval: 30s
path: /metrics
Getting Help¶
If these solutions don't resolve your issue:
- Collect Diagnostics:
- Check MCP Mesh Logs:
- Community Resources:
- GitHub Issues: https://github.com/dhyansraj/mcp-mesh/issues
- Kubernetes Slack: #mcp-mesh channel
💡 Tip: Always check kubectl get events -n mcp-mesh
first - most issues are explained there
📚 Reference: Kubernetes Troubleshooting Guide
🔍 Debug Mode: Set MCP_MESH_DEBUG=true
in pod environment for verbose logging