Troubleshooting K8s Deployments¶

Comprehensive guide to diagnosing and fixing common MCP Mesh deployment issues on Kubernetes

Overview¶

This troubleshooting guide addresses the most common issues encountered when deploying MCP Mesh on Kubernetes. Each issue includes symptoms, diagnostic steps, root cause analysis, and proven solutions. We'll cover pod failures, networking issues, storage problems, and performance bottlenecks.

Quick Diagnostics¶

Run this comprehensive diagnostic script:

#!/bin/bash
# mcp-mesh-k8s-diagnostics.sh

NAMESPACE=${1:-mcp-mesh}

echo "MCP Mesh Kubernetes Diagnostics for namespace: $NAMESPACE"
echo "======================================================="

# Check namespace exists
echo -e "\n1. Checking namespace..."
kubectl get namespace $NAMESPACE || {
    echo "ERROR: Namespace $NAMESPACE not found"
    exit 1
}

# Check pods
echo -e "\n2. Pod Status:"
kubectl get pods -n $NAMESPACE -o wide
echo -e "\nProblematic pods:"
kubectl get pods -n $NAMESPACE --field-selector=status.phase!=Running,status.phase!=Succeeded

# Check services
echo -e "\n3. Service Status:"
kubectl get svc -n $NAMESPACE
echo -e "\nService endpoints:"
kubectl get endpoints -n $NAMESPACE

# Check registry
echo -e "\n4. Registry Status:"
kubectl get statefulset,pod,svc -n $NAMESPACE -l app.kubernetes.io/name=mcp-mesh-registry

# Check events
echo -e "\n5. Recent Events:"
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20

# Check resource usage
echo -e "\n6. Resource Usage:"
kubectl top nodes
kubectl top pods -n $NAMESPACE

# Check persistent volumes
echo -e "\n7. Storage:"
kubectl get pvc -n $NAMESPACE

# Network connectivity test
echo -e "\n8. Network Test:"
kubectl run test-network --rm -it --image=busybox --restart=Never -n $NAMESPACE -- \
    sh -c "nslookup mcp-mesh-registry && echo 'DNS OK' || echo 'DNS FAILED'"

Common Issues and Solutions¶

Issue 1: Pods Stuck in Pending State¶

Symptoms:

NAME                          READY   STATUS    RESTARTS   AGE
mcp-mesh-registry-0           0/1     Pending   0          5m
weather-agent-abc123          0/1     Pending   0          3m

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n mcp-mesh

# Check node resources
kubectl describe nodes
kubectl top nodes

# Check PVC status
kubectl get pvc -n mcp-mesh

Common Causes and Solutions:

Insufficient Resources

# Check resource requests
kubectl describe pod <pod-name> -n mcp-mesh | grep -A10 Requests

# Solution: Scale down other pods or add nodes
kubectl scale deployment <other-deployment> --replicas=0 -n mcp-mesh

# Or reduce resource requests
kubectl patch deployment <deployment-name> -n mcp-mesh -p '
{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "agent",
          "resources": {
            "requests": {
              "cpu": "50m",
              "memory": "64Mi"
            }
          }
        }]
      }
    }
  }
}'

PVC Not Bound

# Check PVC status
kubectl get pvc -n mcp-mesh

# Check available storage classes
kubectl get storageclass

# Create PVC with correct storage class
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: registry-data
  namespace: mcp-mesh
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: standard  # Use available class
  resources:
    requests:
      storage: 5Gi
EOF

Node Selector/Affinity Not Satisfied

# Check node labels
kubectl get nodes --show-labels

# Remove node selector temporarily
kubectl patch deployment <deployment-name> -n mcp-mesh --type='json' -p='[
  {"op": "remove", "path": "/spec/template/spec/nodeSelector"}
]'

Issue 2: Pods in CrashLoopBackOff¶

Symptoms:

NAME                          READY   STATUS             RESTARTS   AGE
analytics-agent-xyz789        0/1     CrashLoopBackOff   5          10m

Diagnosis:

# Check logs from current run
kubectl logs <pod-name> -n mcp-mesh

# Check logs from previous run
kubectl logs <pod-name> -n mcp-mesh --previous

# Check container exit code
kubectl describe pod <pod-name> -n mcp-mesh | grep -A10 "Last State"

Common Causes and Solutions:

Missing Environment Variables

# Check current env vars
kubectl exec <pod-name> -n mcp-mesh -- env

# Add missing variables
kubectl set env deployment/<deployment-name> \
  MCP_MESH_REGISTRY_URL=http://mcp-mesh-registry:8000 \
  -n mcp-mesh

Registry Connection Failed

# Add init container to wait for registry
spec:
  initContainers:
    - name: wait-for-registry
      image: busybox:1.35
      command: ["sh", "-c"]
      args:
        - |
          until nc -z mcp-mesh-registry 8000; do
            echo "Waiting for registry..."
            sleep 2
          done

Permission Errors

# Fix file permissions
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
    - name: agent
      securityContext:
        allowPrivilegeEscalation: false
        runAsNonRoot: true

Issue 3: Service Discovery Not Working¶

Symptoms:

Agents can't find registry
"connection refused" errors
DNS resolution failures

Diagnosis:

# Test DNS from pod
kubectl exec -it <pod-name> -n mcp-mesh -- nslookup mcp-mesh-registry

# Check service endpoints
kubectl get endpoints mcp-mesh-registry -n mcp-mesh

# Test connectivity
kubectl exec -it <pod-name> -n mcp-mesh -- wget -O- http://mcp-mesh-registry:8000/health

Solutions:

DNS Issues

# Configure pod DNS
spec:
  dnsPolicy: ClusterFirst
  dnsConfig:
    options:
      - name: ndots
        value: "1"

Service Selector Mismatch

# Verify labels match
kubectl get svc mcp-mesh-registry -o yaml | grep -A5 selector
kubectl get pods -l app.kubernetes.io/name=mcp-mesh-registry --show-labels

Network Policy Blocking

# Check network policies
kubectl get networkpolicy -n mcp-mesh

# Temporarily disable
kubectl delete networkpolicy --all -n mcp-mesh

Issue 4: High Memory/CPU Usage¶

Symptoms:

Pods getting OOMKilled
Slow response times
Node pressure

Diagnosis:

# Check resource usage
kubectl top pods -n mcp-mesh
kubectl describe pod <pod-name> -n mcp-mesh | grep -A20 Containers

# Check for memory leaks
kubectl exec <pod-name> -n mcp-mesh -- ps aux

Solutions:

Increase Resource Limits

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

Enable Horizontal Pod Autoscaling

kubectl autoscale deployment <deployment-name> \
  --min=2 --max=10 \
  --cpu-percent=70 \
  -n mcp-mesh

Optimize Application

env:
  - name: GOGC
    value: "50" # More aggressive garbage collection
  - name: GOMEMLIMIT
    value: "900MiB" # Soft memory limit

Issue 5: Persistent Volume Issues¶

Symptoms:

Data loss after pod restart
Permission denied errors
Disk full errors

Diagnosis:

# Check PVC status
kubectl get pvc -n mcp-mesh
kubectl describe pvc <pvc-name> -n mcp-mesh

# Check disk usage in pod
kubectl exec <pod-name> -n mcp-mesh -- df -h

Solutions:

Expand PVC

# For expandable storage classes
kubectl patch pvc <pvc-name> -n mcp-mesh -p '{"spec":{"resources":{"requests":{"storage":"20Gi"}}}}'

Fix Permissions

# Add init container to fix permissions
initContainers:
  - name: fix-permissions
    image: busybox
    command: ["sh", "-c", "chown -R 1000:1000 /data"]
    volumeMounts:
      - name: data
        mountPath: /data

Issue 6: Image Pull Errors¶

Symptoms:

Failed to pull image "mcpmesh/python-runtime:0.6": rpc error: code = Unknown desc = Error response from daemon: pull access denied

Solutions:

For Minikube Local Images

# Use Minikube's Docker
eval $(minikube docker-env)
docker build -t mcp-mesh/agent:0.2 .

# Set imagePullPolicy
kubectl patch deployment <deployment-name> -n mcp-mesh -p '
{
  "spec": {
    "template": {
      "spec": {
        "containers": [{
          "name": "agent",
          "imagePullPolicy": "Never"
        }]
      }
    }
  }
}'

For Private Registry

# Create pull secret
kubectl create secret docker-registry regcred \
  --docker-server=myregistry.io \
  --docker-username=user \
  --docker-password=pass \
  --docker-email=email@example.com \
  -n mcp-mesh

# Add to deployment
kubectl patch deployment <deployment-name> -n mcp-mesh -p '
{
  "spec": {
    "template": {
      "spec": {
        "imagePullSecrets": [{"name": "regcred"}]
      }
    }
  }
}'

Performance Troubleshooting¶

Slow Agent Startup¶

Diagnosis:

# Check startup time
kubectl logs <pod-name> -n mcp-mesh | grep -E "started|ready"

# Profile startup
kubectl exec <pod-name> -n mcp-mesh -- python -m cProfile -o profile.stats agent.py

Solutions:

Add startup probe with longer timeout
Optimize imports and initialization
Use init containers for pre-warming

High Latency Between Agents¶

Diagnosis:

# Test network latency
kubectl exec -it <pod-name> -n mcp-mesh -- ping <other-pod-ip>

# Check service mesh metrics (if using Istio)
kubectl exec -it <pod-name> -c istio-proxy -n mcp-mesh -- curl localhost:15000/stats/prometheus

Solutions:

Use node affinity to colocate related agents
Enable pod topology spread constraints
Optimize serialization/deserialization

Debugging Tools and Commands¶

Essential kubectl Commands¶

# Get comprehensive pod info
kubectl get pod <pod-name> -n mcp-mesh -o yaml

# Watch pod status changes
kubectl get pods -n mcp-mesh -w

# Get all resources in namespace
kubectl get all -n mcp-mesh

# Describe problematic resources
kubectl describe pod/deployment/service <name> -n mcp-mesh

# Check RBAC permissions
kubectl auth can-i --list --namespace=mcp-mesh

Advanced Debugging¶

# Enable verbose logging
kubectl set env deployment/<deployment-name> LOG_LEVEL=DEBUG -n mcp-mesh

# Port forward for direct access
kubectl port-forward pod/<pod-name> 8080:8080 -n mcp-mesh

# Copy files from pod
kubectl cp <pod-name>:/path/to/file ./local-file -n mcp-mesh

# Run debug container
kubectl debug <pod-name> -it --image=busybox -n mcp-mesh

Monitoring Commands¶

# Real-time resource monitoring
watch -n 2 'kubectl top pods -n mcp-mesh'

# Check cluster events
kubectl get events -n mcp-mesh --sort-by='.lastTimestamp' -w

# View audit logs (if enabled)
kubectl logs -n kube-system -l component=kube-apiserver | grep mcp-mesh

Recovery Procedures¶

Emergency Pod Recovery¶

#!/bin/bash
# emergency-recovery.sh

NAMESPACE=mcp-mesh

echo "Starting emergency recovery..."

# Delete stuck pods
kubectl delete pods --field-selector=status.phase=Failed -n $NAMESPACE
kubectl delete pods --field-selector=status.phase=Unknown -n $NAMESPACE

# Restart all deployments
kubectl rollout restart deployment -n $NAMESPACE

# Force delete stuck PVCs
kubectl patch pvc <pvc-name> -n $NAMESPACE -p '{"metadata":{"finalizers":null}}'

# Reset failed jobs
kubectl delete jobs --field-selector=status.successful=0 -n $NAMESPACE

echo "Recovery complete. Checking status..."
kubectl get all -n $NAMESPACE

Data Recovery¶

# Backup registry data
kubectl exec mcp-mesh-registry-0 -n mcp-mesh -- \
  tar czf /tmp/backup.tar.gz /data

kubectl cp mcp-mesh-registry-0:/tmp/backup.tar.gz ./registry-backup.tar.gz -n mcp-mesh

# Restore registry data
kubectl cp ./registry-backup.tar.gz mcp-mesh-registry-0:/tmp/backup.tar.gz -n mcp-mesh
kubectl exec mcp-mesh-registry-0 -n mcp-mesh -- \
  tar xzf /tmp/backup.tar.gz -C /

Prevention Strategies¶

Resource Management¶

# Set up ResourceQuota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: mcp-mesh-quota
  namespace: mcp-mesh
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    persistentvolumeclaims: "10"
    pods: "50"

Pod Disruption Budgets¶

# Ensure availability during updates
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: registry-pdb
  namespace: mcp-mesh
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app.kubernetes.io/name: mcp-mesh-registry

Monitoring Setup¶

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mcp-mesh-agents
  namespace: mcp-mesh
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: agent
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Getting Help¶

If these solutions don't resolve your issue:

Collect Diagnostics:

kubectl cluster-info dump --namespace=mcp-mesh > cluster-dump.txt

Check MCP Mesh Logs:

kubectl logs -l app.kubernetes.io/part-of=mcp-mesh -n mcp-mesh > mcp-mesh-logs.txt

Community Resources:
GitHub Issues: https://github.com/dhyansraj/mcp-mesh/issues
Kubernetes Slack: #mcp-mesh channel

💡 Tip: Always check kubectl get events -n mcp-mesh first - most issues are explained there

📚 Reference: Kubernetes Troubleshooting Guide

🔍 Debug Mode: Set MCP_MESH_DEBUG=true in pod environment for verbose logging