Kubernetes Monitoring - Setting Up Prometheus and Grafana

Introduction

Setting up robust monitoring in Kubernetes is crucial for maintaining healthy clusters and applications. This guide covers the implementation of Prometheus and Grafana, along with related CNCF tools for comprehensive monitoring.

CNCF Monitoring Stack

1. Prometheus (CNCF Graduated)

Core monitoring and alerting toolkit:

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Stack (includes Grafana)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.scrapeInterval=30s

# Verify installation
kubectl get pods -n monitoring
kubectl get svc -n monitoring

# Port forward Prometheus
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090

# Check Prometheus targets
curl localhost:9090/api/v1/targets | jq .

2. Grafana (CNCF Incubating)

Visualization and dashboarding:

# Access Grafana (if installed with prometheus-stack)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Get Grafana admin password
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

# Import dashboards via API
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"dashboard": {...}, "overwrite": true}' \
  http://admin:password@localhost:3000/api/dashboards/db

# Check Grafana health
kubectl exec -n monitoring deploy/prometheus-grafana -- curl localhost:3000/api/health

3. Thanos (CNCF Incubating)

For long-term storage and high availability:

# Install Thanos
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install thanos bitnami/thanos \
  --namespace monitoring \
  --set objstoreConfig="type: s3\nconfig:\n  bucket: thanos\n  endpoint: s3.amazonaws.com"

# Setup Thanos sidecar in Prometheus
kubectl patch prometheus k8s -n monitoring --type=merge -p '{"spec":{"thanos":{"baseImage":"quay.io/thanos/thanos:v0.31.0"}}}'

# Query metrics via Thanos
thanos query \
  --http-address=0.0.0.0:9090 \
  --store=thanos-sidecar.monitoring.svc.cluster.local:10901

4. Loki (CNCF Sandbox)

Log aggregation system:

# Install Loki stack
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set grafana.enabled=false \
  --set prometheus.enabled=false

# Add Loki datasource to Grafana
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  loki.yaml: |
    apiVersion: 1
    datasources:
    - name: Loki
      type: loki
      access: proxy
      url: http://loki:3100
      version: 1
EOF

# Query logs
logcli query '{app="nginx"}' --addr="http://loki:3100"

Monitoring Setup Commands

Basic Monitoring Setup

# Create monitoring namespace
kubectl create namespace monitoring

# Label namespace for monitoring
kubectl label namespace monitoring monitoring=enabled

# Create ServiceMonitor for your app
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: your-app
  endpoints:
  - port: metrics
EOF

# Check ServiceMonitor status
kubectl get servicemonitor -n monitoring

Alert Management

# Create AlertManager config
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: app-alerts
  namespace: monitoring
spec:
  route:
    receiver: 'slack'
  receivers:
  - name: 'slack'
    slackConfigs:
    - apiURL:
        key: url
        name: slack-url
      channel: '#alerts'
EOF

# Check AlertManager config
kubectl get alertmanagerconfig -n monitoring

# View active alerts
curl -s http://localhost:9093/api/v1/alerts | jq .

Recording Rules

# Create recording rules
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-rules
  namespace: monitoring
spec:
  groups:
  - name: app.rules
    rules:
    - record: job:http_requests_total:rate5m
      expr: sum(rate(http_requests_total[5m])) by (job)
EOF

# Check rules
kubectl get prometheusrule -n monitoring

Dashboard Management

Grafana Dashboard Management

# Export dashboard
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
  http://grafana:3000/api/dashboards/uid/$DASHBOARD_UID > dashboard.json

# Import dashboard
curl -X POST \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d @dashboard.json \
  http://grafana:3000/api/dashboards/db

# Create dashboard folder
curl -X POST \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"title":"Production"}' \
  http://grafana:3000/api/folders

Dashboard Provisioning

# Create dashboard provisioning config
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards
  namespace: monitoring
data:
  dashboards.yaml: |
    apiVersion: 1
    providers:
    - name: 'default'
      orgId: 1
      folder: ''
      type: file
      disableDeletion: false
      editable: true
      options:
        path: /var/lib/grafana/dashboards
EOF

Monitoring Best Practices

Resource Monitoring

# Monitor Prometheus resource usage
kubectl top pod -n monitoring -l app=prometheus

# Check Prometheus storage
kubectl get pvc -n monitoring

# Monitor Grafana resources
kubectl top pod -n monitoring -l app.kubernetes.io/name=grafana

# Check metrics scrape status
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job:.labels.job, health:.health}'

Performance Optimization

# Check Prometheus query performance
curl -s 'http://localhost:9090/api/v1/query_range?query=rate(prometheus_http_requests_total[5m])&start=1625097600&end=1625184000&step=60' | jq .

# Monitor Prometheus WAL
kubectl exec -it prometheus-prometheus-0 -n monitoring -- df -h /prometheus

# Check Prometheus memory usage
kubectl exec -it prometheus-prometheus-0 -n monitoring -- ps aux | grep prometheus

Backup and Restore

# Backup Prometheus data
kubectl cp prometheus-prometheus-0:/prometheus prometheus-backup -n monitoring

# Backup Grafana
kubectl exec -it prometheus-grafana-0 -n monitoring -- tar czf /tmp/grafana-backup.tar.gz /var/lib/grafana

# Copy backup locally
kubectl cp prometheus-grafana-0:/tmp/grafana-backup.tar.gz grafana-backup.tar.gz -n monitoring

Troubleshooting Commands

# Check Prometheus logs
kubectl logs -f -n monitoring prometheus-prometheus-0

# Check Grafana logs
kubectl logs -f -n monitoring deploy/prometheus-grafana

# Verify Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'

# Check AlertManager status
curl -s http://localhost:9093/api/v1/status | jq .

# Test recording rules
curl -s 'http://localhost:9090/api/v1/rules' | jq .

# Verify metrics collection
curl -s http://localhost:9090/api/v1/label/__name__/values | jq .

Integration with Other Tools

1. OpenTelemetry Integration

# Install OpenTelemetry Operator
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

# Create OpenTelemetry Collector
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: monitoring
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      memory_limiter:
        limit_mib: 1000
    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"
    service:
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter]
          exporters: [prometheus]
EOF

2. Jaeger Integration

# Install Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.41.0/jaeger-operator.yaml -n observability

# Create Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: monitoring
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
EOF

Maintenance Tasks

Daily Checks

# Check monitoring stack health
kubectl get pods -n monitoring
kubectl top pods -n monitoring

# Verify data collection
curl -s "http://localhost:9090/api/v1/query?query=up" | jq .

# Check alert status
curl -s http://localhost:9093/api/v1/alerts | jq .

Weekly Tasks

# Clean up old data
curl -X POST http://localhost:9090/-/clean

# Update dashboards
helm upgrade prometheus prometheus-community/kube-prometheus-stack -n monitoring

# Verify backup jobs
kubectl get cronjob -n monitoring

Monthly Tasks

# Audit monitoring coverage
kubectl get servicemonitor,podmonitor -A

# Review resource allocation
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check long-term storage
thanos tools bucket verify --objstore.config-file=bucket.yaml