[blog-api-server] Monitoring Dashboard and Alerting System

[blog-api-server] Monitoring Dashboard and Alerting System

Overview

A Prometheus-based monitoring system and Slack/Email alerting have been added to blog-api-server, enabling real-time server status tracking and immediate response to issues.

Monitoring Architecture

Prometheus Metrics

Basic HTTP Metrics

MetricTypeLabelsDescription
http_requests_totalCountermethod, endpoint, statusTotal HTTP requests
http_request_duration_secondsHistogrammethod, endpointRequest latency distribution
http_errors_totalCountermethod, endpoint, statusTotal error count
active_requestsGauge-Current active requests

Business Metrics

MetricTypeLabelsDescription
git_operations_totalCounteroperation, statusGit operation count
git_operation_duration_secondsHistogramoperationGit operation duration
translation_requests_totalCountersource_lang, target_lang, statusTranslation request count
translation_duration_secondsHistogram-Translation duration
post_operations_totalCounteroperation, language, statusPost operation count

Metrics Collection Code

# prometheus_exporter.py
from prometheus_client import Counter, Histogram, Gauge

http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

# Auto-collection in middleware
class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        start_time = time.time()
        response = await call_next(request)
        duration = time.time() - start_time

        http_requests_total.labels(
            method=request.method,
            endpoint=request.url.path,
            status=response.status_code
        ).inc()

        http_request_duration_seconds.labels(
            method=request.method,
            endpoint=request.url.path
        ).observe(duration)

Alerting System

Alert Rules

Rule NameConditionSeverityCooldown
High Error Rateerror_rate > 5%WARNING5 min
Critical Error Rateerror_rate > 20%CRITICAL1 min
Slow Response Timeavg_response_time > 2000msWARNING10 min
High Slow Request Rateslow_request_rate > 10%WARNING5 min

Alert Channels

Slack Webhook

  • Supports all severity levels
  • Color-coded severity (INFO: green, WARNING: orange, ERROR: red, CRITICAL: dark red)

Email

  • CRITICAL level only
  • SMTP-based email delivery

AlertManager Code

# alerting.py
class AlertManager:
    def check_and_alert(self, metrics: Dict[str, Any]):
        for rule in self.rules:
            if rule.should_trigger(metrics):
                self._send_alert(rule, metrics)

    def _send_alert(self, rule: AlertRule, metrics: Dict[str, Any]):
        message = f"""
Alert Rule: {rule.name}
Condition: {rule.condition}

Current Metrics:
- Total Requests: {metrics.get('total_requests', 0)}
- Error Count: {metrics.get('error_count', 0)}
- Error Rate: {metrics.get('error_rate_percent', 0)}%
- Slow Requests: {metrics.get('slow_request_count', 0)}
"""
        self.slack.send(
            title=f"Alert: {rule.name}",
            message=message.strip(),
            severity=rule.severity
        )

Dashboard Configuration

Monitoring Endpoints

PathAuthDescription
/healthNot RequiredServer health check
/metricsRequiredJSON format metrics
/metrics/prometheusNot RequiredPrometheus format metrics
/metrics/resetRequiredReset metrics
/dashboardNot RequiredWeb dashboard
/alerts/rulesRequiredAlert rules list
/alerts/sendRequiredManual alert send

Request Tracking Middleware

# middleware.py
class MonitoringMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        request_id = str(uuid.uuid4())
        start_time = time.time()

        response = await call_next(request)
        process_time = (time.time() - start_time) * 1000

        # Add tracking info to response headers
        response.headers["X-Process-Time"] = f"{process_time:.2f}"
        response.headers["X-Request-ID"] = request_id

        return response

Environment Variables

# Slack Alerts
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

# Email Alerts
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
SMTP_USERNAME=your-email@gmail.com
SMTP_PASSWORD=your-app-password
ALERT_FROM_EMAIL=your-email@gmail.com
ALERT_TO_EMAILS=admin@example.com,ops@example.com

# Threshold Settings
SLOW_REQUEST_THRESHOLD=1000  # 1 second
VERY_SLOW_THRESHOLD=3000     # 3 seconds
MAX_BODY_LOG_LENGTH=1000

Future Plans

  1. Grafana Integration: Visualize Prometheus data in Grafana dashboard
  2. Dynamic Alert Rules: API to add/remove alert rules at runtime
  3. Metrics Retention: Long-term metrics storage and analysis
  4. Enhanced Health Checks: Dependency service (Git, LLM) status checks

Conclusion

With Prometheus-based monitoring and Slack/Email alerting, infrastructure is now in place for real-time server monitoring and immediate incident response.


Korean Version: 한국어 버전

Built with Hugo
Theme Stack designed by Jimmy