Monitoring
Version: 2.5.0 Last Updated: 2025-10-14 Status: Production Ready
Table of Contents
- Overview
- Architecture
- Quick Start
- Metrics Reference
- Dashboards
- Alerting
- Logging
- Health Checks
- Troubleshooting
- Production Deployment
Overview
The AINative Studio monitoring infrastructure provides comprehensive observability for all operations, performance metrics, error tracking, and system health monitoring.
Key Features
- Real-time Metrics: Prometheus-based metrics collection with 15s granularity
- Rich Dashboards: 4 pre-built Grafana dashboards for different views
- Smart Alerting: Multi-tier alerting (Critical/Warning/Info) with PagerDuty, Slack, and Email
- Structured Logging: JSON-formatted logs with context propagation
- Health Monitoring: Comprehensive health checks for all components
- Performance Tracking: P50/P95/P99 latency tracking per operation
Monitoring Stack
| Component | Purpose | Port |
|---|---|---|
| Prometheus | Metrics collection & storage | 9090 |
| Grafana | Visualization & dashboards | 3000 |
| AlertManager | Alert routing & notification | 9093 |
| Node Exporter | System metrics | 9100 |
| Postgres Exporter | Database metrics | 9187 |
| Redis Exporter | Cache metrics | 9121 |
| cAdvisor | Container metrics | 8080 |
| Loki | Log aggregation (optional) | 3100 |
Architecture
┌─────────────────────────────────────────────────────────────┐
│ AINative Studio API │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /mcp/metrics │ │ /mcp/health │ │
│ │ (Prometheus) │ │ (Health Check) │ │
│ └────────┬────────┘ └────────┬────────┘ │
└───────────┼────────────────────┼─────────────────────────────┘
│ │
│ scrape (15s) │ poll (10s)
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Prometheus │ │ Monitoring │
│ Metrics │ │ Scripts │
└───────┬───────┘ └───────────────┘
│
│ evaluate rules (30s)
▼
┌───────────────┐
│ AlertManager │
└───────┬───────┘
│
├─── Critical ──→ PagerDuty + Slack
├─── Warning ───→ Slack
└─── Info ──────→ Email
┌───────────────┐
│ Grafana │ ←── Query ─── Prometheus
│ Dashboards │ ←── Query ─── Loki (logs)
└───────────────┘
Quick Start
1. Start Monitoring Stack
cd /Users/aideveloper/core/monitoring
docker-compose up -d
2. Verify Services
# Check all services are running
docker-compose ps
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Check Grafana
open http://localhost:3000
# Login: admin / admin
3. Configure AINative Studio
Add to /Users/aideveloper/core/src/backend/app/main.py:
from prometheus_client import make_asgi_app
# Mount Prometheus metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
4. Test Health Endpoint
curl http://localhost:8000/api/v1/mcp/health
5. Access Dashboards
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- AlertManager: http://localhost:9093
Metrics Reference
Operation Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_operations_total | Counter | operation, status | Total operations by type and outcome |
mcp_operation_duration_seconds | Histogram | operation | Operation duration distribution |
mcp_operation_latency_seconds | Summary | operation | Operation latency summary (P50/P95/P99) |
Error Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_errors_total | Counter | error_category, operation | Errors by category and operation |
mcp_rate_limit_hits_total | Counter | user_tier, operation | Rate limit violations |
Resource Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_active_connections | Gauge | - | Active database connections |
mcp_storage_bytes | Gauge | storage_type | Storage usage in bytes |
mcp_vector_dimensions | Gauge | namespace | Vector dimensions being processed |
Category-Specific Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_vector_operations_total | Counter | operation_type | Vector operations count |
mcp_quantum_operations_total | Counter | operation_type | Quantum operations count |
mcp_file_operations_total | Counter | operation_type | File operations count |
mcp_quantum_compression_ratio | Histogram | - | Quantum compression efficiency |
Cache Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_cache_hits_total | Counter | cache_type | Cache hits |
mcp_cache_misses_total | Counter | cache_type | Cache misses |
Database Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
mcp_database_query_duration_seconds | Histogram | query_type | Database query duration |
Dashboards
1. MCP Overview Dashboard (mcp-overview)
Purpose: High-level system health and performance
Panels:
- Total Operations per Second (graph)
- Success Rate (stat - green/yellow/red thresholds)
- Error Rate (stat)
- Active Database Connections (gauge)
- Operation Latency P50/P95/P99 (graph)
- System CPU & Memory Usage
- Error Breakdown by Category (pie chart)
- Rate Limit Hits
- Cache Hit Rate
Use Cases:
- Daily operations monitoring
- Quick health assessment
- Incident detection
2. MCP Operations Dashboard (mcp-operations)
Purpose: Deep dive into operation-level metrics
Panels:
- Vector Operations Rate
- Quantum Operations Rate
- File Operations Rate
- Top 10 Slowest Operations (table)
- Operations by Status (timeseries)
- Database Query Duration by Type
- Operation Throughput
- Operation Duration Distribution (heatmap)
Use Cases:
- Performance optimization
- Operation troubleshooting
- Capacity planning
3. MCP Performance Dashboard (mcp-performance)
Purpose: Performance analysis and SLO tracking
Panels:
- Response Time Distribution (P50/P75/P90/P95/P99)
- Throughput (req/s)
- P99 Latency by Operation
- Database Connection Pool Utilization
- Storage Usage
- Quantum Compression Ratio
- Vector Dimensions Distribution
Alerts:
- P99 Latency > 1s
- Connection Pool > 85%
Use Cases:
- SLO monitoring
- Performance regression detection
- Resource optimization
4. MCP Errors Dashboard (mcp-errors)
Purpose: Error tracking and troubleshooting
Panels:
- Error Rate Over Time
- Current Error Rate (stat)
- Total Errors (24h)
- Error Rate Percentage (gauge)
- Errors by Category (pie chart)
- Errors by Operation (pie chart)
- Top 10 Error-Prone Operations (table)
- Error Trend (24h)
- Rate Limit Violations
- Failed vs Successful Operations
- Database/Validation/Timeout/Auth Errors (stats)
Alerts:
- Error rate > 0.5 errors/sec
Use Cases:
- Incident response
- Error pattern analysis
- Debugging
Alerting
Alert Tiers
Critical (PagerDuty + Slack)
| Alert | Condition | Duration | Action |
|---|---|---|---|
| HighErrorRate | Error rate > 5% | 5 minutes | Immediate page |
| HighLatency | P99 > 1s | 10 minutes | Immediate page |
| DatabaseConnectionPoolExhausted | Connections > 95 | 2 minutes | Immediate page |
| MCPBridgeDown | Service unreachable | 2 minutes | Immediate page |
| DatabaseQueryTimeout | Timeouts > 10 | 5 minutes | Immediate page |
Warning (Slack)
| Alert | Condition | Duration | Action |
|---|---|---|---|
| ElevatedErrorRate | Error rate > 1% | 10 minutes | Investigate |
| IncreasedLatency | P95 > 500ms | 15 minutes | Monitor |
| HighRateLimitHits | Rate limit hits > 50% | 15 minutes | Review quotas |
| HighMemoryUsage | Memory > 80% | 10 minutes | Check for leaks |
| HighCPUUsage | CPU > 80% | 10 minutes | Investigate |
| DatabaseConnectionsHigh | Connections > 80 | 10 minutes | Monitor |
Info (Email)
| Alert | Condition | Duration | Action |
|---|---|---|---|
| HighThroughput | > 1000 ops/s | 5 minutes | Capacity planning |
| StorageGrowth | Growth > 10%/hour | 1 hour | Monitor |
| QuantumOperationsIncreasing | > 100 ops/s | 30 minutes | Note trend |
Alert Routing
Critical Alert Flow:
Alert Fires → Prometheus → AlertManager
→ PagerDuty (SMS/Phone)
→ Slack #mcp-alerts-critical
→ Include runbook URL
Warning Alert Flow:
Alert Fires → Prometheus → AlertManager
→ Slack #mcp-alerts-warning
→ Include dashboard URL
Info Alert Flow:
Alert Fires → Prometheus → AlertManager
→ Email to mcp-team@example.com
→ Daily digest format
Runbooks
Each critical alert includes a runbook URL. Create runbooks at:
/Users/aideveloper/core/docs/runbooks/
├── mcp-high-error-rate.md
├── mcp-high-latency.md
├── db-connection-pool.md
├── mcp-service-down.md
└── db-timeout.md
Runbook Template:
# Alert: [Alert Name]
## Severity
[Critical/Warning/Info]
## Description
[What this alert means]
## Impact
[User impact and business impact]
## Diagnosis
1. Check [specific dashboard]
2. Review logs: `tail -f /var/log/mcp_bridge/mcp_error.log`
3. Query metrics: [example PromQL query]
## Resolution
1. [Step-by-step resolution]
2. [Include rollback if needed]
## Escalation
If not resolved in [time], escalate to [team/person]
Logging
Log Levels
| Level | Purpose | Destination | Retention |
|---|---|---|---|
| DEBUG | Performance traces | performance.log | 5 days |
| INFO | Operation logs | mcp.log | 30 days |
| WARNING | Degraded performance | mcp.log + syslog | 30 days |
| ERROR | Operation failures | mcp_error.log | 90 days |
| CRITICAL | System failures | mcp_error.log + syslog | 90 days |
Log Structure
All logs use JSON format for machine parsing:
{
"timestamp": "2025-10-14T12:00:00.123Z",
"level": "INFO",
"logger": "mcp_bridge",
"operation": "upsert_vector",
"operation_id": "op_abc123",
"user_id": "user_123",
"project_id": "proj_456",
"duration_ms": 45.2,
"status": "success",
"message": "Vector upserted successfully"
}
Log Files
/var/log/mcp_bridge/
├── mcp.log # All INFO+ logs (100MB, 10 files)
├── mcp_error.log # ERROR+ logs (50MB, 10 files)
├── audit.log # Audit trail (daily rotation, 90 days)
└── performance.log # Performance traces (100MB, 5 files)
Querying Logs
Find errors in last hour:
grep -E '"level":"ERROR"' /var/log/mcp_bridge/mcp_error.log | \
jq 'select(.timestamp > "'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'")'
Find slow operations (> 1s):
jq 'select(.duration_ms > 1000)' /var/log/mcp_bridge/performance.log
Count errors by operation:
grep -E '"level":"ERROR"' /var/log/mcp_bridge/mcp_error.log | \
jq -r '.operation' | sort | uniq -c | sort -rn
Log Aggregation (Production)
For production, ship logs to centralized logging:
-
Loki (included in docker-compose):
- Grafana-native log aggregation
- Query alongside metrics
-
ELK Stack (alternative):
- Configure logstash output in logging.yml
- Ship to Elasticsearch
-
CloudWatch Logs (AWS):
- Use CloudWatch agent
- Set retention policies
Health Checks
Endpoint: GET /api/v1/mcp/health
Response (Healthy):
{
"status": "healthy",
"timestamp": "2025-10-14T12:00:00Z",
"version": "2.5.0",
"service": "mcp_bridge",
"components": {
"database": {
"status": "healthy",
"type": "postgresql"
},
"rate_limiter": {
"status": "healthy",
"active_windows": 42
},
"services": {
"status": "healthy",
"operations_available": 60
}
},
"system": {
"cpu_percent": 23.4,
"memory_percent": 45.2,
"disk_percent": 67.8
}
}
Response (Degraded):
{
"status": "degraded",
"timestamp": "2025-10-14T12:00:00Z",
"version": "2.5.0",
"service": "mcp_bridge",
"components": {
"database": {
"status": "unhealthy",
"error": "Connection timeout"
},
"rate_limiter": {
"status": "healthy",
"active_windows": 42
}
}
}
Health Check Monitoring
Add to monitoring:
# Check every 30 seconds
watch -n 30 'curl -s http://localhost:8000/api/v1/mcp/health | jq .status'
In Prometheus:
scrape_configs:
- job_name: 'mcp_health'
metrics_path: '/api/v1/mcp/health'
scrape_interval: 30s
Troubleshooting
High Error Rate
Symptoms: mcp_errors_total increasing rapidly
Diagnosis:
- Check error dashboard: http://localhost:3000/d/mcp-errors
- Review error logs:
tail -f /var/log/mcp_bridge/mcp_error.log - Check error categories:
sum by (error_category) (rate(mcp_errors_total[5m]))
Common Causes:
- Database connection issues → Check
pg_stat_activity - Validation errors → Review recent API changes
- Timeout errors → Check database query performance
- Auth errors → Verify token/API key configuration
High Latency
Symptoms: P99 latency > 1s
Diagnosis:
- Check performance dashboard: http://localhost:3000/d/mcp-performance
- Identify slow operations:
topk(10, histogram_quantile(0.99, rate(mcp_operation_duration_seconds_bucket[5m]))) - Check database query performance:
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 10;
Common Causes:
- Database queries without indexes
- Large vector operations
- Connection pool exhaustion
- Quantum operations timeout
Memory Leak
Symptoms: Memory usage steadily increasing
Diagnosis:
- Check memory trend over 24h
- Review connection pool metrics
- Check for unclosed database connections:
SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction';
Resolution:
- Restart service (immediate)
- Review code for connection leaks
- Enable connection pool debugging
Database Connection Pool Exhausted
Symptoms: mcp_active_connections > 95
Immediate Action:
# Restart application to reset connections
docker-compose restart mcp_bridge
# Or kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < now() - interval '5 minutes';
Root Cause Analysis:
- Check slow queries blocking connections
- Review connection pool configuration
- Check for connection leaks in code
Production Deployment
Pre-Deployment Checklist
- Configure AlertManager with real PagerDuty/Slack credentials
- Set up persistent volumes for Prometheus data
- Configure log rotation policies
- Set up SSL/TLS for Grafana
- Create admin passwords (not default admin/admin)
- Configure backup strategy for metrics data
- Set up monitoring for the monitoring stack itself
- Document escalation procedures
- Create runbooks for all critical alerts
- Test alert routing end-to-end
- Configure network policies and firewall rules
- Set up authentication for Prometheus/Grafana
Production Configuration
1. Update AlertManager (/Users/aideveloper/core/monitoring/alertmanager.yml):
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/REAL/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_REAL_PAGERDUTY_KEY'
2. Secure Grafana:
# docker-compose.yml
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
- GF_SERVER_ROOT_URL=https://grafana.yourdomain.com
- GF_SERVER_CERT_FILE=/etc/grafana/grafana.crt
- GF_SERVER_CERT_KEY=/etc/grafana/grafana.key
3. Configure Retention:
# prometheus.yml
global:
storage.tsdb.retention.time: 90d # Keep 90 days of data
storage.tsdb.retention.size: 50GB # Or 50GB, whichever comes first
4. Set Up Remote Write (for long-term storage):
remote_write:
- url: "https://prometheus-remote-write.yourdomain.com/api/v1/write"
queue_config:
max_samples_per_send: 10000
Kubernetes Deployment
For Kubernetes, use Prometheus Operator:
# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack
# Apply custom ServiceMonitor
kubectl apply -f k8s/servicemonitor-mcp-bridge.yaml
ServiceMonitor Example (k8s/servicemonitor-mcp-bridge.yaml):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mcp-bridge
spec:
selector:
matchLabels:
app: mcp-bridge
endpoints:
- port: metrics
interval: 15s
High Availability
For HA monitoring:
- Prometheus: Run 2+ instances with Thanos for deduplication
- Grafana: Run behind load balancer with shared database
- AlertManager: Run 3+ instances in cluster mode
Best Practices
1. Metric Naming
✅ Good:
mcp_operation_duration_secondsmcp_errors_totalmcp_cache_hits_total
❌ Bad:
operation_time(missing unit)errors(not descriptive)cache_hits(should be counter with _total)
2. Alert Tuning
- Start conservative: Set thresholds higher, lower gradually
- Use percentiles: P95/P99 instead of max/avg
- Group alerts: Use for/group_wait to avoid alert storms
- Test alerts: Use
amtoolto test alert routing
3. Dashboard Design
- One dashboard, one purpose: Don't mix overview with deep-dive
- Use variables: Allow filtering by project/user/operation
- Show trends: Include historical data for context
- Link dashboards: Cross-link related dashboards
4. Log Management
- Structure logs: Always use JSON for production
- Add context: Include operation_id, user_id, etc.
- Sanitize sensitive data: Redact passwords, tokens
- Set retention: Balance cost vs compliance requirements
Support
Resources
- Documentation:
/Users/aideveloper/core/docs/ - Runbooks:
/Users/aideveloper/core/docs/runbooks/ - Dashboards: http://localhost:3000
- Metrics: http://localhost:9090
Contact
- Critical Issues: PagerDuty will automatically page on-call engineer
- Questions: #mcp-monitoring Slack channel
- Feature Requests: Create GitHub issue with label
monitoring
Changelog
Version 2.5.0 (2025-10-14)
- Initial comprehensive monitoring setup
- Added 4 Grafana dashboards
- Configured multi-tier alerting
- Implemented structured logging
- Added health check endpoint
- Created Prometheus metrics
END OF MONITORING GUIDE