Monitoring

Version: 2.5.0 Last Updated: 2025-10-14 Status: Production Ready

Overview
Architecture
Quick Start
Metrics Reference
Dashboards
Alerting
Logging
Health Checks
Troubleshooting
Production Deployment

Overview

The AINative Studio monitoring infrastructure provides comprehensive observability for all operations, performance metrics, error tracking, and system health monitoring.

Key Features

Real-time Metrics: Prometheus-based metrics collection with 15s granularity
Rich Dashboards: 4 pre-built Grafana dashboards for different views
Smart Alerting: Multi-tier alerting (Critical/Warning/Info) with PagerDuty, Slack, and Email
Structured Logging: JSON-formatted logs with context propagation
Health Monitoring: Comprehensive health checks for all components
Performance Tracking: P50/P95/P99 latency tracking per operation

Monitoring Stack

Component	Purpose	Port
Prometheus	Metrics collection & storage	9090
Grafana	Visualization & dashboards	3000
AlertManager	Alert routing & notification	9093
Node Exporter	System metrics	9100
Postgres Exporter	Database metrics	9187
Redis Exporter	Cache metrics	9121
cAdvisor	Container metrics	8080
Loki	Log aggregation (optional)	3100

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      AINative Studio API                         │
│  ┌─────────────────┐  ┌─────────────────┐                  │
│  │ /mcp/metrics    │  │ /mcp/health     │                  │
│  │ (Prometheus)    │  │ (Health Check)  │                  │
│  └────────┬────────┘  └────────┬────────┘                  │
└───────────┼────────────────────┼─────────────────────────────┘
            │                    │
            │ scrape (15s)       │ poll (10s)
            ▼                    ▼
    ┌───────────────┐    ┌───────────────┐
    │  Prometheus   │    │   Monitoring  │
    │   Metrics     │    │   Scripts     │
    └───────┬───────┘    └───────────────┘
            │
            │ evaluate rules (30s)
            ▼
    ┌───────────────┐
    │ AlertManager  │
    └───────┬───────┘
            │
            ├─── Critical ──→ PagerDuty + Slack
            ├─── Warning ───→ Slack
            └─── Info ──────→ Email

    ┌───────────────┐
    │   Grafana     │ ←── Query ─── Prometheus
    │  Dashboards   │ ←── Query ─── Loki (logs)
    └───────────────┘

Quick Start

1. Start Monitoring Stack

cd /Users/aideveloper/core/monitoring
docker-compose up -d

2. Verify Services

# Check all services are running
docker-compose ps

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Check Grafana
open http://localhost:3000
# Login: admin / admin

3. Configure AINative Studio

Add to /Users/aideveloper/core/src/backend/app/main.py:

from prometheus_client import make_asgi_app

# Mount Prometheus metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

4. Test Health Endpoint

curl http://localhost:8000/api/v1/mcp/health

5. Access Dashboards

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin)
AlertManager: http://localhost:9093

Metrics Reference

Operation Metrics

Metric	Type	Labels	Description
`mcp_operations_total`	Counter	operation, status	Total operations by type and outcome
`mcp_operation_duration_seconds`	Histogram	operation	Operation duration distribution
`mcp_operation_latency_seconds`	Summary	operation	Operation latency summary (P50/P95/P99)

Error Metrics

Metric	Type	Labels	Description
`mcp_errors_total`	Counter	error_category, operation	Errors by category and operation
`mcp_rate_limit_hits_total`	Counter	user_tier, operation	Rate limit violations

Resource Metrics

Metric	Type	Labels	Description
`mcp_active_connections`	Gauge	-	Active database connections
`mcp_storage_bytes`	Gauge	storage_type	Storage usage in bytes
`mcp_vector_dimensions`	Gauge	namespace	Vector dimensions being processed

Category-Specific Metrics

Metric	Type	Labels	Description
`mcp_vector_operations_total`	Counter	operation_type	Vector operations count
`mcp_quantum_operations_total`	Counter	operation_type	Quantum operations count
`mcp_file_operations_total`	Counter	operation_type	File operations count
`mcp_quantum_compression_ratio`	Histogram	-	Quantum compression efficiency

Cache Metrics

Metric	Type	Labels	Description
`mcp_cache_hits_total`	Counter	cache_type	Cache hits
`mcp_cache_misses_total`	Counter	cache_type	Cache misses

Database Metrics

Metric	Type	Labels	Description
`mcp_database_query_duration_seconds`	Histogram	query_type	Database query duration

Dashboards

1. MCP Overview Dashboard (`mcp-overview`)

Purpose: High-level system health and performance

Panels:

Total Operations per Second (graph)
Success Rate (stat - green/yellow/red thresholds)
Error Rate (stat)
Active Database Connections (gauge)
Operation Latency P50/P95/P99 (graph)
System CPU & Memory Usage
Error Breakdown by Category (pie chart)
Rate Limit Hits
Cache Hit Rate

Use Cases:

Daily operations monitoring
Quick health assessment
Incident detection

2. MCP Operations Dashboard (`mcp-operations`)

Purpose: Deep dive into operation-level metrics

Panels:

Vector Operations Rate
Quantum Operations Rate
File Operations Rate
Top 10 Slowest Operations (table)
Operations by Status (timeseries)
Database Query Duration by Type
Operation Throughput
Operation Duration Distribution (heatmap)

Use Cases:

Performance optimization
Operation troubleshooting
Capacity planning

3. MCP Performance Dashboard (`mcp-performance`)

Purpose: Performance analysis and SLO tracking

Panels:

Response Time Distribution (P50/P75/P90/P95/P99)
Throughput (req/s)
P99 Latency by Operation
Database Connection Pool Utilization
Storage Usage
Quantum Compression Ratio
Vector Dimensions Distribution

Alerts:

P99 Latency > 1s
Connection Pool > 85%

Use Cases:

SLO monitoring
Performance regression detection
Resource optimization

4. MCP Errors Dashboard (`mcp-errors`)

Purpose: Error tracking and troubleshooting

Panels:

Error Rate Over Time
Current Error Rate (stat)
Total Errors (24h)
Error Rate Percentage (gauge)
Errors by Category (pie chart)
Errors by Operation (pie chart)
Top 10 Error-Prone Operations (table)
Error Trend (24h)
Rate Limit Violations
Failed vs Successful Operations
Database/Validation/Timeout/Auth Errors (stats)

Alerts:

Error rate > 0.5 errors/sec

Use Cases:

Incident response
Error pattern analysis
Debugging

Alerting

Alert Tiers

Critical (PagerDuty + Slack)

Alert	Condition	Duration	Action
HighErrorRate	Error rate > 5%	5 minutes	Immediate page
HighLatency	P99 > 1s	10 minutes	Immediate page
DatabaseConnectionPoolExhausted	Connections > 95	2 minutes	Immediate page
MCPBridgeDown	Service unreachable	2 minutes	Immediate page
DatabaseQueryTimeout	Timeouts > 10	5 minutes	Immediate page

Warning (Slack)

Alert	Condition	Duration	Action
ElevatedErrorRate	Error rate > 1%	10 minutes	Investigate
IncreasedLatency	P95 > 500ms	15 minutes	Monitor
HighRateLimitHits	Rate limit hits > 50%	15 minutes	Review quotas
HighMemoryUsage	Memory > 80%	10 minutes	Check for leaks
HighCPUUsage	CPU > 80%	10 minutes	Investigate
DatabaseConnectionsHigh	Connections > 80	10 minutes	Monitor

Info (Email)

Alert	Condition	Duration	Action
HighThroughput	> 1000 ops/s	5 minutes	Capacity planning
StorageGrowth	Growth > 10%/hour	1 hour	Monitor
QuantumOperationsIncreasing	> 100 ops/s	30 minutes	Note trend

Alert Routing

Critical Alert Flow:
  Alert Fires → Prometheus → AlertManager
    → PagerDuty (SMS/Phone)
    → Slack #mcp-alerts-critical
    → Include runbook URL

Warning Alert Flow:
  Alert Fires → Prometheus → AlertManager
    → Slack #mcp-alerts-warning
    → Include dashboard URL

Info Alert Flow:
  Alert Fires → Prometheus → AlertManager
    → Email to mcp-team@example.com
    → Daily digest format

Runbooks

Each critical alert includes a runbook URL. Create runbooks at:

/Users/aideveloper/core/docs/runbooks/
├── mcp-high-error-rate.md
├── mcp-high-latency.md
├── db-connection-pool.md
├── mcp-service-down.md
└── db-timeout.md

Runbook Template:

# Alert: [Alert Name]

## Severity
[Critical/Warning/Info]

## Description
[What this alert means]

## Impact
[User impact and business impact]

## Diagnosis
1. Check [specific dashboard]
2. Review logs: `tail -f /var/log/mcp_bridge/mcp_error.log`
3. Query metrics: [example PromQL query]

## Resolution
1. [Step-by-step resolution]
2. [Include rollback if needed]

## Escalation
If not resolved in [time], escalate to [team/person]

Logging

Log Levels

Level	Purpose	Destination	Retention
DEBUG	Performance traces	performance.log	5 days
INFO	Operation logs	mcp.log	30 days
WARNING	Degraded performance	mcp.log + syslog	30 days
ERROR	Operation failures	mcp_error.log	90 days
CRITICAL	System failures	mcp_error.log + syslog	90 days

Log Structure

All logs use JSON format for machine parsing:

{
  "timestamp": "2025-10-14T12:00:00.123Z",
  "level": "INFO",
  "logger": "mcp_bridge",
  "operation": "upsert_vector",
  "operation_id": "op_abc123",
  "user_id": "user_123",
  "project_id": "proj_456",
  "duration_ms": 45.2,
  "status": "success",
  "message": "Vector upserted successfully"
}

Log Files

/var/log/mcp_bridge/
├── mcp.log              # All INFO+ logs (100MB, 10 files)
├── mcp_error.log        # ERROR+ logs (50MB, 10 files)
├── audit.log            # Audit trail (daily rotation, 90 days)
└── performance.log      # Performance traces (100MB, 5 files)

Querying Logs

Find errors in last hour:

grep -E '"level":"ERROR"' /var/log/mcp_bridge/mcp_error.log | \
  jq 'select(.timestamp > "'$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)'")'

Find slow operations (> 1s):

jq 'select(.duration_ms > 1000)' /var/log/mcp_bridge/performance.log

Count errors by operation:

grep -E '"level":"ERROR"' /var/log/mcp_bridge/mcp_error.log | \
  jq -r '.operation' | sort | uniq -c | sort -rn

Log Aggregation (Production)

For production, ship logs to centralized logging:

Loki (included in docker-compose):
- Grafana-native log aggregation
- Query alongside metrics
ELK Stack (alternative):
- Configure logstash output in logging.yml
- Ship to Elasticsearch
CloudWatch Logs (AWS):
- Use CloudWatch agent
- Set retention policies

Health Checks

Endpoint: `GET /api/v1/mcp/health`

Response (Healthy):

{
  "status": "healthy",
  "timestamp": "2025-10-14T12:00:00Z",
  "version": "2.5.0",
  "service": "mcp_bridge",
  "components": {
    "database": {
      "status": "healthy",
      "type": "postgresql"
    },
    "rate_limiter": {
      "status": "healthy",
      "active_windows": 42
    },
    "services": {
      "status": "healthy",
      "operations_available": 60
    }
  },
  "system": {
    "cpu_percent": 23.4,
    "memory_percent": 45.2,
    "disk_percent": 67.8
  }
}

Response (Degraded):

{
  "status": "degraded",
  "timestamp": "2025-10-14T12:00:00Z",
  "version": "2.5.0",
  "service": "mcp_bridge",
  "components": {
    "database": {
      "status": "unhealthy",
      "error": "Connection timeout"
    },
    "rate_limiter": {
      "status": "healthy",
      "active_windows": 42
    }
  }
}

Health Check Monitoring

Add to monitoring:

# Check every 30 seconds
watch -n 30 'curl -s http://localhost:8000/api/v1/mcp/health | jq .status'

In Prometheus:

scrape_configs:
  - job_name: 'mcp_health'
    metrics_path: '/api/v1/mcp/health'
    scrape_interval: 30s

Troubleshooting

High Error Rate

Symptoms: mcp_errors_total increasing rapidly

Diagnosis:

Check error dashboard: http://localhost:3000/d/mcp-errors
Review error logs: tail -f /var/log/mcp_bridge/mcp_error.log

Check error categories:

sum by (error_category) (rate(mcp_errors_total[5m]))

Common Causes:

Database connection issues → Check pg_stat_activity
Validation errors → Review recent API changes
Timeout errors → Check database query performance
Auth errors → Verify token/API key configuration

High Latency

Symptoms: P99 latency > 1s

Diagnosis:

Check performance dashboard: http://localhost:3000/d/mcp-performance

Identify slow operations:

topk(10, histogram_quantile(0.99, rate(mcp_operation_duration_seconds_bucket[5m])))

Check database query performance:

SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 10;

Common Causes:

Database queries without indexes
Large vector operations
Connection pool exhaustion
Quantum operations timeout

Memory Leak

Symptoms: Memory usage steadily increasing

Diagnosis:

Check memory trend over 24h
Review connection pool metrics

Check for unclosed database connections:

SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction';

Resolution:

Restart service (immediate)
Review code for connection leaks
Enable connection pool debugging

Database Connection Pool Exhausted

Symptoms: mcp_active_connections > 95

Immediate Action:

# Restart application to reset connections
docker-compose restart mcp_bridge

# Or kill idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < now() - interval '5 minutes';

Root Cause Analysis:

Check slow queries blocking connections
Review connection pool configuration
Check for connection leaks in code

Production Deployment

Pre-Deployment Checklist

Production Configuration

1. Update AlertManager (/Users/aideveloper/core/monitoring/alertmanager.yml):

global:
  slack_api_url: 'https://hooks.slack.com/services/YOUR/REAL/WEBHOOK'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_REAL_PAGERDUTY_KEY'

2. Secure Grafana:

# docker-compose.yml
environment:
  - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD}
  - GF_SERVER_ROOT_URL=https://grafana.yourdomain.com
  - GF_SERVER_CERT_FILE=/etc/grafana/grafana.crt
  - GF_SERVER_CERT_KEY=/etc/grafana/grafana.key

3. Configure Retention:

# prometheus.yml
global:
  storage.tsdb.retention.time: 90d  # Keep 90 days of data
  storage.tsdb.retention.size: 50GB  # Or 50GB, whichever comes first

4. Set Up Remote Write (for long-term storage):

remote_write:
  - url: "https://prometheus-remote-write.yourdomain.com/api/v1/write"
    queue_config:
      max_samples_per_send: 10000

Kubernetes Deployment

For Kubernetes, use Prometheus Operator:

# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack

# Apply custom ServiceMonitor
kubectl apply -f k8s/servicemonitor-mcp-bridge.yaml

ServiceMonitor Example (k8s/servicemonitor-mcp-bridge.yaml):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: mcp-bridge
spec:
  selector:
    matchLabels:
      app: mcp-bridge
  endpoints:
    - port: metrics
      interval: 15s

High Availability

For HA monitoring:

Prometheus: Run 2+ instances with Thanos for deduplication
Grafana: Run behind load balancer with shared database
AlertManager: Run 3+ instances in cluster mode

Best Practices

1. Metric Naming

✅ Good:

mcp_operation_duration_seconds
mcp_errors_total
mcp_cache_hits_total

❌ Bad:

operation_time (missing unit)
errors (not descriptive)
cache_hits (should be counter with _total)

2. Alert Tuning

Start conservative: Set thresholds higher, lower gradually
Use percentiles: P95/P99 instead of max/avg
Group alerts: Use for/group_wait to avoid alert storms
Test alerts: Use amtool to test alert routing

3. Dashboard Design

One dashboard, one purpose: Don't mix overview with deep-dive
Use variables: Allow filtering by project/user/operation
Show trends: Include historical data for context
Link dashboards: Cross-link related dashboards

4. Log Management

Structure logs: Always use JSON for production
Add context: Include operation_id, user_id, etc.
Sanitize sensitive data: Redact passwords, tokens
Set retention: Balance cost vs compliance requirements

Support

Resources

Documentation: /Users/aideveloper/core/docs/
Runbooks: /Users/aideveloper/core/docs/runbooks/
Dashboards: http://localhost:3000
Metrics: http://localhost:9090

Contact

Critical Issues: PagerDuty will automatically page on-call engineer
Questions: #mcp-monitoring Slack channel
Feature Requests: Create GitHub issue with label monitoring

Changelog

Version 2.5.0 (2025-10-14)

Initial comprehensive monitoring setup
Added 4 Grafana dashboards
Configured multi-tier alerting
Implemented structured logging
Added health check endpoint
Created Prometheus metrics

END OF MONITORING GUIDE

Table of Contents​

Overview​

Key Features​

Monitoring Stack​

Architecture​

Quick Start​

1. Start Monitoring Stack​

2. Verify Services​

3. Configure AINative Studio​

4. Test Health Endpoint​

5. Access Dashboards​

Metrics Reference​

Operation Metrics​

Error Metrics​

Resource Metrics​

Category-Specific Metrics​

Cache Metrics​

Database Metrics​

Dashboards​

1. MCP Overview Dashboard (mcp-overview)​

2. MCP Operations Dashboard (mcp-operations)​

3. MCP Performance Dashboard (mcp-performance)​

4. MCP Errors Dashboard (mcp-errors)​

Alerting​

Alert Tiers​

Critical (PagerDuty + Slack)​

Warning (Slack)​

Info (Email)​

Alert Routing​

Runbooks​

Logging​

Log Levels​

Log Structure​

Log Files​

Querying Logs​

Log Aggregation (Production)​

Health Checks​

Endpoint: GET /api/v1/mcp/health​

Health Check Monitoring​

Troubleshooting​

High Error Rate​

High Latency​

Memory Leak​

Database Connection Pool Exhausted​

Production Deployment​

Pre-Deployment Checklist​

Production Configuration​

Kubernetes Deployment​

High Availability​

Best Practices​

1. Metric Naming​

2. Alert Tuning​

3. Dashboard Design​

4. Log Management​

Support​

Resources​

Contact​

Changelog​

Version 2.5.0 (2025-10-14)​

Table of Contents

Overview

Key Features

Monitoring Stack

Architecture

Quick Start

1. Start Monitoring Stack

2. Verify Services

3. Configure AINative Studio

4. Test Health Endpoint

5. Access Dashboards

Metrics Reference

Operation Metrics

Error Metrics

Resource Metrics

Category-Specific Metrics

Cache Metrics

Database Metrics

Dashboards

1. MCP Overview Dashboard (`mcp-overview`)

2. MCP Operations Dashboard (`mcp-operations`)

3. MCP Performance Dashboard (`mcp-performance`)

4. MCP Errors Dashboard (`mcp-errors`)

Alerting

Alert Tiers

Critical (PagerDuty + Slack)

Warning (Slack)

Info (Email)

Alert Routing

Runbooks

Logging

Log Levels

Log Structure

Log Files

Querying Logs

Log Aggregation (Production)

Health Checks

Endpoint: `GET /api/v1/mcp/health`

Health Check Monitoring

Troubleshooting

High Error Rate

High Latency

Memory Leak

Database Connection Pool Exhausted

Production Deployment

Pre-Deployment Checklist

Production Configuration

Kubernetes Deployment

High Availability

Best Practices

1. Metric Naming

2. Alert Tuning

3. Dashboard Design

4. Log Management

Support

Resources

Contact

Changelog

Version 2.5.0 (2025-10-14)