Docker Deployment Troubleshooting Guide¶
Quick solutions to common Docker deployment issues with MCP Mesh
Overview¶
This comprehensive troubleshooting guide addresses common issues encountered when deploying MCP Mesh agents with Docker. Each issue includes diagnostic steps, root cause analysis, and proven solutions.
Quick Diagnostics¶
Run this comprehensive diagnostic script:
#!/bin/bash
echo "MCP Mesh Docker Diagnostics"
echo "==========================="
# Check Docker daemon
echo -n "Docker daemon: "
docker version > /dev/null 2>&1 && echo "RUNNING" || echo "NOT RUNNING"
# Check Docker Compose
echo -n "Docker Compose: "
docker-compose version > /dev/null 2>&1 && echo "INSTALLED" || echo "NOT FOUND"
# Check running containers
echo -e "\nRunning containers:"
docker ps --format "table {% raw %}{{.Names}}{% endraw %}\t{% raw %}{{.Status}}{% endraw %}\t{% raw %}{{.Ports}}{% endraw %}"
# Check networks
echo -e "\nDocker networks:"
docker network ls --filter name=mesh
# Check volumes
echo -e "\nDocker volumes:"
docker volume ls --filter name=mesh
# Check resource usage
echo -e "\nResource usage:"
docker system df
# Check container health
echo -e "\nContainer health:"
docker ps --format "table {% raw %}{{.Names}}{% endraw %}\t{% raw %}{{.Status}}{% endraw %}" | grep -E "(healthy|unhealthy|starting)"
# Check registry connectivity
echo -e "\nRegistry status:"
curl -s http://localhost:8000/health 2>/dev/null | jq -r '.status' || echo "NOT ACCESSIBLE"
Common Issues and Solutions¶
Issue 1: Container Fails to Start¶
Symptoms:
Diagnosis:
# Check logs
docker-compose logs agent
# Inspect container
docker inspect $(docker-compose ps -q agent)
# Check events
docker events --since 10m --filter container=agent
Solutions:
- Image not found:
- Port already in use:
# Find process using port
sudo lsof -i :8000
# Change port in docker-compose.yml
ports:
- "8001:8000" # Use different host port
- Permission issues:
Issue 2: Agent Can't Connect to Registry¶
Symptoms:
Failed to register with registry: connection refused
Registry at http://localhost:8000 not accessible
Diagnosis:
# Test from host
curl http://localhost:8000/health
# Test from container
docker-compose exec agent curl http://registry:8000/health
# Check DNS resolution
docker-compose exec agent nslookup registry
Solutions:
- Wrong hostname:
- Network isolation:
- Startup order:
Issue 3: Database Connection Errors¶
Symptoms:
FATAL: password authentication failed for user "postgres"
could not connect to server: Connection refused
Solutions:
- Environment variables not set:
# Use .env file
environment:
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
# Or docker-compose override
docker-compose run -e POSTGRES_PASSWORD=secret postgres
- Database not initialized:
# Remove old volume and reinitialize
docker-compose down -v
docker volume rm project_postgres_data
docker-compose up -d postgres
- Health check timing:
Issue 4: Container Keeps Restarting¶
Symptoms:
Diagnosis:
# Check exit code
docker-compose ps
# View recent logs
docker-compose logs --tail=50 agent
# Check restart policy
docker inspect agent | jq '.[0].HostConfig.RestartPolicy'
Solutions:
- Application crashes:
# Temporarily disable restart
restart: "no"
# Run interactively to debug
docker-compose run --rm agent bash
- Missing environment variables:
- Entrypoint issues:
Issue 5: Out of Memory Errors¶
Symptoms:
Solutions:
- Set memory limits:
- Optimize application:
# In agent code
import gc
def process_large_data():
# Process in chunks
for chunk in data_chunks:
process(chunk)
gc.collect() # Force garbage collection
- Monitor memory usage:
Issue 6: Volume Permission Issues¶
Symptoms:
Solutions:
- Fix ownership:
# Check current ownership
docker-compose exec agent ls -la /data
# Fix from host
sudo chown -R 1000:1000 ./data
# Or use init container
services:
init-permissions:
image: busybox
volumes:
- data:/data
command: chown -R 1000:1000 /data
- Use proper user in container:
Issue 7: Slow Container Startup¶
Symptoms:
- Container takes minutes to become ready
- Health checks timing out
Solutions:
- Optimize image:
# Multi-stage build
FROM python:3.11 AS builder
COPY requirements.txt .
RUN pip wheel --no-cache-dir -r requirements.txt
FROM python:3.11-slim
COPY --from=builder *.whl .
RUN pip install --no-cache-dir *.whl
- Adjust health check timing:
- Pre-compile Python:
Issue 8: Network Communication Issues¶
Symptoms:
- Containers can't reach each other
- DNS resolution failures
- Intermittent connection errors
Solutions:
- DNS debugging:
# Test DNS from container
docker-compose exec agent nslookup registry
docker-compose exec agent ping -c 3 registry
# Check resolv.conf
docker-compose exec agent cat /etc/resolv.conf
- Network inspection:
# List networks
docker network ls
# Inspect network
docker network inspect mesh-net
# Check container networks
docker inspect agent | jq '.[0].NetworkSettings.Networks'
- Fix network configuration:
Issue 9: Build Failures¶
Symptoms:
Solutions:
- Clear build cache:
# Remove all build cache
docker builder prune -a
# Build without cache
docker-compose build --no-cache agent
- Fix package sources:
# Update package lists
RUN apt-get update && apt-get install -y ...
# Use specific package versions
RUN pip install package==1.2.3
- Handle network issues:
Issue 10: Docker Compose Version Issues¶
Symptoms:
Solutions:
- Check Docker Compose version:
- Use compatible syntax:
# Use version 3.8 features carefully
version: '3.8'
# Or downgrade to widely supported version
version: '3.3'
Performance Issues¶
High CPU Usage¶
# Find CPU-hungry containers
docker stats --no-stream --format "table {% raw %}{{.Container}}{% endraw %}\t{% raw %}{{.CPUPerc}}{% endraw %}"
# Limit CPU usage
services:
agent:
deploy:
resources:
limits:
cpus: '0.5'
Disk Space Issues¶
# Check disk usage
docker system df
# Clean up
docker system prune -a --volumes
# Remove specific items
docker container prune
docker image prune
docker volume prune
docker network prune
Emergency Recovery¶
Complete Reset¶
#!/bin/bash
# emergency-reset.sh
echo "WARNING: This will delete all Docker data!"
read -p "Continue? (y/N) " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
# Stop all containers
docker-compose down
# Remove all containers
docker rm -f $(docker ps -aq) 2>/dev/null
# Remove all images
docker rmi -f $(docker images -q) 2>/dev/null
# Remove all volumes
docker volume rm $(docker volume ls -q) 2>/dev/null
# Remove all networks
docker network rm $(docker network ls -q) 2>/dev/null
# Restart Docker
sudo systemctl restart docker
echo "Docker reset complete"
fi
Backup Before Troubleshooting¶
#!/bin/bash
# backup-docker-state.sh
BACKUP_DIR="docker-backup-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Export compose configuration
docker-compose config > "$BACKUP_DIR/docker-compose.resolved.yml"
# Save running container state
docker ps -a > "$BACKUP_DIR/containers.txt"
# Export volumes
for volume in $(docker volume ls -q); do
docker run --rm -v $volume:/data -v $(pwd)/$BACKUP_DIR:/backup \
busybox tar czf /backup/$volume.tar.gz /data
done
echo "Backup saved to $BACKUP_DIR"
Getting Help¶
If these solutions don't resolve your issue:
- Collect diagnostic information:
docker-compose logs > docker-logs.txt
docker-compose ps > docker-status.txt
docker-compose config > docker-config.txt
docker version > docker-version.txt
-
Check GitHub issues:
-
https://github.com/dhyansraj/mcp-mesh/issues
-
Community support:
- MCP Discord: https://discord.gg/mcp
- Stack Overflow: Tag with
mcp-mesh
anddocker
💡 Tip: Always test solutions in a development environment first
📚 Reference: Docker Troubleshooting Guide
🔍 Debug Mode: Set COMPOSE_DEBUG=true
for verbose Docker Compose output