Skip to main content

Monitoring

Version: 2.5.0 Last Updated: 2025-10-14 Status: Production Ready


Table of Contents

  1. Overview
  2. Architecture
  3. Quick Start
  4. Metrics Reference
  5. Dashboards
  6. Alerting
  7. Logging
  8. Health Checks
  9. Troubleshooting
  10. Production Deployment

Overview

The AINative Studio monitoring infrastructure provides comprehensive observability for all operations, performance metrics, error tracking, and system health monitoring.

Key Features

  • Real-time Metrics: Prometheus-based metrics collection with 15s granularity
  • Rich Dashboards: 4 pre-built Grafana dashboards for different views
  • Smart Alerting: Multi-tier alerting (Critical/Warning/Info) with PagerDuty, Slack, and Email
  • Structured Logging: JSON-formatted logs with context propagation
  • Health Monitoring: Comprehensive health checks for all components
  • Performance Tracking: P50/P95/P99 latency tracking per operation

Monitoring Stack

ComponentPurposePort
PrometheusMetrics collection & storage9090
GrafanaVisualization & dashboards3000
AlertManagerAlert routing & notification9093
Node ExporterSystem metrics9100
Postgres ExporterDatabase metrics9187
Redis ExporterCache metrics9121
cAdvisorContainer metrics8080
LokiLog aggregation (optional)3100

Architecture

┌─────────────────────────────────────────────────────────────┐
│ AINative Studio API │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /mcp/metrics │ │ /mcp/health │ │
│ │ (Prometheus) │ │ (Health Check) │ │
│ └────────┬────────┘ └────────┬────────┘ │
└───────────┼────────────────────┼─────────────────────────────┘
│ │
│ scrape (15s) │ poll (10s)
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Prometheus │ │ Monitoring │
│ Metrics │ │ Scripts │
└───────┬───────┘ └───────────────┘

│ evaluate rules (30s)

┌───────────────┐
│ AlertManager │
└───────┬───────┘

├─── Critical ──→ PagerDuty + Slack
├─── Warning ───→ Slack
└─── Info ──────→ Email

┌───────────────┐
│ Grafana │ ←── Query ─── Prometheus
│ Dashboards │ ←── Query ─── Loki (logs)
└───────────────┘

Quick Start

1. Start Monitoring Stack

cd /Users/aideveloper/core/monitoring
docker-compose up -d

2. Verify Services

# Check all services are running
docker-compose ps

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Check Grafana
open http://localhost:3000
# Login: admin / admin

3. Configure AINative Studio

Add to /Users/aideveloper/core/src/backend/app/main.py:

from prometheus_client import make_asgi_app

# Mount Prometheus metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

4. Test Health Endpoint

curl http://localhost:8000/api/v1/mcp/health

5. Access Dashboards


Metrics Reference

Operation Metrics

MetricTypeLabelsDescription
mcp_operations_totalCounteroperation, statusTotal operations by type and outcome
mcp_operation_duration_secondsHistogramoperationOperation duration distribution
mcp_operation_latency_secondsSummaryoperationOperation latency summary (P50/P95/P99)

Error Metrics

MetricTypeLabelsDescription
mcp_errors_totalCountererror_category, operationErrors by category and operation
mcp_rate_limit_hits_totalCounteruser_tier, operationRate limit violations

Resource Metrics

MetricTypeLabelsDescription
mcp_active_connectionsGauge-Active database connections
mcp_storage_bytesGaugestorage_typeStorage usage in bytes
mcp_vector_dimensionsGaugenamespaceVector dimensions being processed

Category-Specific Metrics

MetricTypeLabelsDescription
mcp_vector_operations_totalCounteroperation_typeVector operations count
mcp_quantum_operations_totalCounteroperation_typeQuantum operations count
mcp_file_operations_totalCounteroperation_typeFile operations count
mcp_quantum_compression_ratioHistogram-Quantum compression efficiency

Cache Metrics

MetricTypeLabelsDescription
mcp_cache_hits_totalCountercache_typeCache hits
mcp_cache_misses_totalCountercache_typeCache misses

Database Metrics

MetricTypeLabelsDescription
mcp_database_query_duration_secondsHistogramquery_typeDatabase query duration

Dashboards

1. MCP Overview Dashboard (mcp-overview)

Purpose: High-level system health and performance

Panels:

  • Total Operations per Second (graph)
  • Success Rate (stat - green/yellow/red thresholds)
  • Error Rate (stat)
  • Active Database Connections (gauge)
  • Operation Latency P50/P95/P99 (graph)
  • System CPU & Memory Usage
  • Error Breakdown by Category (pie chart)
  • Rate Limit Hits
  • Cache Hit Rate

Use Cases:

  • Daily operations monitoring
  • Quick health assessment
  • Incident detection

2. MCP Operations Dashboard (mcp-operations)

Purpose: Deep dive into operation-level metrics

Panels:

  • Vector Operations Rate
  • Quantum Operations Rate
  • File Operations Rate
  • Top 10 Slowest Operations (table)
  • Operations by Status (timeseries)
  • Database Query Duration by Type
  • Operation Throughput
  • Operation Duration Distribution (heatmap)

Use Cases:

  • Performance optimization
  • Operation troubleshooting
  • Capacity planning

3. MCP Performance Dashboard (mcp-performance)

Purpose: Performance analysis and SLO tracking

Panels:

  • Response Time Distribution (P50/P75/P90/P95/P99)
  • Throughput (req/s)
  • P99 Latency by Operation
  • Database Connection Pool Utilization
  • Storage Usage
  • Quantum Compression Ratio
  • Vector Dimensions Distribution

Alerts:

  • P99 Latency > 1s
  • Connection Pool > 85%

Use Cases:

  • SLO monitoring
  • Performance regression detection
  • Resource optimization

4. MCP Errors Dashboard (mcp-errors)

Purpose: Error tracking and troubleshooting

Panels:

  • Error Rate Over Time
  • Current Error Rate (stat)
  • Total Errors (24h)
  • Error Rate Percentage (gauge)
  • Errors by Category (pie chart)
  • Errors by Operation (pie chart)
  • Top 10 Error-Prone Operations (table)
  • Error Trend (24h)
  • Rate Limit Violations
  • Failed vs Successful Operations
  • Database/Validation/Timeout/Auth Errors (stats)

Alerts:

  • Error rate > 0.5 errors/sec

Use Cases:

  • Incident response
  • Error pattern analysis
  • Debugging

Alerting

Alert Tiers

Critical (PagerDuty + Slack)

AlertConditionDurationAction
HighErrorRateError rate > 5%5 minutesImmediate page
HighLatencyP99 > 1s10 minutesImmediate page
DatabaseConnectionPoolExhaustedConnections > 952 minutesImmediate page
MCPBridgeDownService unreachable2 minutesImmediate page
DatabaseQueryTimeoutTimeouts > 105 minutesImmediate page

Warning (Slack)

AlertConditionDurationAction
ElevatedErrorRateError rate > 1%10 minutesInvestigate
IncreasedLatencyP95 > 500ms15 minutesMonitor
HighRateLimitHitsRate limit hits > 50%15 minutesReview quotas
HighMemoryUsageMemory > 80%10 minutesCheck for leaks
HighCPUUsageCPU > 80%10 minutesInvestigate
DatabaseConnectionsHighConnections > 8010 minutesMonitor

Info (Email)

AlertConditionDurationAction
HighThroughput> 1000 ops/s5 minutesCapacity planning
StorageGrowthGrowth > 10%/hour1 hourMonitor
QuantumOperationsIncreasing> 100 ops/s30 minutesNote trend

Alert Routing

Critical Alert Flow:
Alert Fires → Prometheus → AlertManager
→ PagerDuty (SMS/Phone)
→ Slack #mcp-alerts-critical
→ Include runbook URL

Warning Alert Flow:
Alert Fires → Prometheus → AlertManager
→ Slack #mcp-alerts-warning
→ Include dashboard URL

Info Alert Flow:
Alert Fires → Prometheus → AlertManager
→ Email to mcp-team@example.com
→ Daily digest format

Runbooks

Each critical alert includes a runbook URL. Create runbooks at:

/Users/aideveloper/core/docs/runbooks/
├── mcp-high-error-rate.md
├── mcp-high-latency.md
├── db-connection-pool.md
├── mcp-service-down.md
└── db-timeout.md

Runbook Template:

# Alert: [Alert Name]

## Severity
[Critical/Warning/Info]

## Description
[What this alert means]

## Impact
[User impact and business impact]

## Diagnosis
1. Check [specific dashboard]
2. Review logs: `tail -f /var/log/mcp_bridge/mcp_error.log`
3. Query metrics: [example PromQL query]

## Resolution
1. [Step-by-step resolution]
2. [Include rollback if needed]

## Escalation
If not resolved in [time], escalate to [team/person]

Logging

Log Levels

LevelPurposeDestinationRetention
DEBUGPerformance tracesperformance.log5 days
INFOOperation logsmcp.log30 days
WARNINGDegraded performancemcp.log + syslog30 days
ERROROperation failuresmcp_error.log90 days
CRITICALSystem failuresmcp_error.log + syslog90 days

Log Structure

All logs use JSON format for machine parsing:

{
"timestamp": "2025-10-14T12:00:00.123Z",
"level": "INFO",
"logger": "mcp_bridge",
"operation": "upsert_vector",
"operation_id": "op_abc123",
"user_id": "user_123",
"project_id": "proj_456",
"duration_ms": 45.2,
"status": "success",
"message": "Vector upserted successfully"
}

Log Files

/var/log/mcp_bridge/
├── mcp.log # All INFO+ logs (100MB, 10 files)
├── mcp_error.log # ERROR+ logs (50MB, 10 files)
├── audit.log # Audit trail (daily rotation, 90 days)
└── performance.log # Performance traces (100MB, 5 files)

Querying Logs

Find errors in last hour:

grep -E '"level":"ERROR"' /var/log/mcp_bridge/mcp_error.log | \
jq 'select(.timestamp > "'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'")'

Find slow operations (> 1s):

jq 'select(.duration_ms > 1000)' /var/log/mcp_bridge/performance.log

Count errors by operation:

grep -E '"level":"ERROR"' /var/log/mcp_bridge/mcp_error.log | \
jq -r '.operation' | sort | uniq -c | sort -rn

Log Aggregation (Production)

For production, ship logs to centralized logging:

  1. Loki (included in docker-compose):

    • Grafana-native log aggregation
    • Query alongside metrics
  2. ELK Stack (alternative):

    • Configure logstash output in logging.yml
    • Ship to Elasticsearch
  3. CloudWatch Logs (AWS):

    • Use CloudWatch agent
    • Set retention policies

Health Checks

Endpoint: GET /api/v1/mcp/health

Response (Healthy):

{
"status": "healthy",
"timestamp": "2025-10-14T12:00:00Z",
"version": "2.5.0",
"service": "mcp_bridge",
"components": {
"database": {
"status": "healthy",
"type": "postgresql"
},
"rate_limiter": {
"status": "healthy",
"active_windows": 42
},
"services": {
"status": "healthy",
"operations_available": 60
}
},
"system": {
"cpu_percent": 23.4,
"memory_percent": 45.2,
"disk_percent": 67.8
}
}

Response (Degraded):

{
"status": "degraded",
"timestamp": "2025-10-14T12:00:00Z",
"version": "2.5.0",
"service": "mcp_bridge",
"components": {
"database": {
"status": "unhealthy",
"error": "Connection timeout"
},
"rate_limiter": {
"status": "healthy",
"active_windows": 42
}
}
}

Health Check Monitoring

Add to monitoring:

# Check every 30 seconds
watch -n 30 'curl -s http://localhost:8000/api/v1/mcp/health | jq .status'

In Prometheus:

scrape_configs:
- job_name: 'mcp_health'
metrics_path: '/api/v1/mcp/health'
scrape_interval: 30s

Troubleshooting

High Error Rate

Symptoms: mcp_errors_total increasing rapidly

Diagnosis:

  1. Check error dashboard: http://localhost:3000/d/mcp-errors
  2. Review error logs: tail -f /var/log/mcp_bridge/mcp_error.log
  3. Check error categories:
    sum by (error_category) (rate(mcp_errors_total[5m]))

Common Causes:

  • Database connection issues → Check pg_stat_activity
  • Validation errors → Review recent API changes
  • Timeout errors → Check database query performance
  • Auth errors → Verify token/API key configuration

High Latency

Symptoms: P99 latency > 1s

Diagnosis:

  1. Check performance dashboard: http://localhost:3000/d/mcp-performance
  2. Identify slow operations:
    topk(10, histogram_quantile(0.99, rate(mcp_operation_duration_seconds_bucket[5m])))
  3. Check database query performance:
    SELECT query, mean_exec_time, calls
    FROM pg_stat_statements
    ORDER BY mean_exec_time DESC LIMIT 10;

Common Causes:

  • Database queries without indexes
  • Large vector operations
  • Connection pool exhaustion
  • Quantum operations timeout

Memory Leak

Symptoms: Memory usage steadily increasing

Diagnosis:

  1. Check memory trend over 24h
  2. Review connection pool metrics
  3. Check for unclosed database connections:
    SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction';

Resolution:

  1. Restart service (immediate)
  2. Review code for connection leaks
  3. Enable connection pool debugging

Database Connection Pool Exhausted

Symptoms: mcp_active_connections > 95

Immediate Action:

# Restart application to reset connections
docker-compose restart mcp_bridge

# Or kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < now() - interval '5 minutes';

Root Cause Analysis:

  1. Check slow queries blocking connections
  2. Review connection pool configuration
  3. Check for connection leaks in code

Production Deployment

Pre-Deployment Checklist

  • Configure AlertManager with real PagerDuty/Slack credentials
  • Set up persistent volumes for Prometheus data
  • Configure log rotation policies
  • Set up SSL/TLS for Grafana
  • Create admin passwords (not default admin/admin)
  • Configure backup strategy for metrics data
  • Set up monitoring for the monitoring stack itself
  • Document escalation procedures
  • Create runbooks for all critical alerts
  • Test alert routing end-to-end
  • Configure network policies and firewall rules
  • Set up authentication for Prometheus/Grafana

Production Configuration

1. Update AlertManager (/Users/aideveloper/core/monitoring/alertmanager.yml):

global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/REAL/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_REAL_PAGERDUTY_KEY'

2. Secure Grafana:

# docker-compose.yml
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
- GF_SERVER_ROOT_URL=https://grafana.yourdomain.com
- GF_SERVER_CERT_FILE=/etc/grafana/grafana.crt
- GF_SERVER_CERT_KEY=/etc/grafana/grafana.key

3. Configure Retention:

# prometheus.yml
global:
storage.tsdb.retention.time: 90d # Keep 90 days of data
storage.tsdb.retention.size: 50GB # Or 50GB, whichever comes first

4. Set Up Remote Write (for long-term storage):

remote_write:
- url: "https://prometheus-remote-write.yourdomain.com/api/v1/write"
queue_config:
max_samples_per_send: 10000

Kubernetes Deployment

For Kubernetes, use Prometheus Operator:

# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack

# Apply custom ServiceMonitor
kubectl apply -f k8s/servicemonitor-mcp-bridge.yaml

ServiceMonitor Example (k8s/servicemonitor-mcp-bridge.yaml):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mcp-bridge
spec:
selector:
matchLabels:
app: mcp-bridge
endpoints:
- port: metrics
interval: 15s

High Availability

For HA monitoring:

  1. Prometheus: Run 2+ instances with Thanos for deduplication
  2. Grafana: Run behind load balancer with shared database
  3. AlertManager: Run 3+ instances in cluster mode

Best Practices

1. Metric Naming

Good:

  • mcp_operation_duration_seconds
  • mcp_errors_total
  • mcp_cache_hits_total

Bad:

  • operation_time (missing unit)
  • errors (not descriptive)
  • cache_hits (should be counter with _total)

2. Alert Tuning

  • Start conservative: Set thresholds higher, lower gradually
  • Use percentiles: P95/P99 instead of max/avg
  • Group alerts: Use for/group_wait to avoid alert storms
  • Test alerts: Use amtool to test alert routing

3. Dashboard Design

  • One dashboard, one purpose: Don't mix overview with deep-dive
  • Use variables: Allow filtering by project/user/operation
  • Show trends: Include historical data for context
  • Link dashboards: Cross-link related dashboards

4. Log Management

  • Structure logs: Always use JSON for production
  • Add context: Include operation_id, user_id, etc.
  • Sanitize sensitive data: Redact passwords, tokens
  • Set retention: Balance cost vs compliance requirements

Support

Resources

Contact

  • Critical Issues: PagerDuty will automatically page on-call engineer
  • Questions: #mcp-monitoring Slack channel
  • Feature Requests: Create GitHub issue with label monitoring

Changelog

Version 2.5.0 (2025-10-14)

  • Initial comprehensive monitoring setup
  • Added 4 Grafana dashboards
  • Configured multi-tier alerting
  • Implemented structured logging
  • Added health check endpoint
  • Created Prometheus metrics

END OF MONITORING GUIDE