Kubernetes Monitoring - Setting Up Prometheus and Grafana
A detailed guide on implementing enterprise-grade monitoring in Kubernetes using Prometheus and Grafana, with practical examples and best practices.
Introduction
Setting up robust monitoring in Kubernetes is crucial for maintaining healthy clusters and applications. This guide covers the implementation of Prometheus and Grafana, along with related CNCF tools for comprehensive monitoring.
CNCF Monitoring Stack
1. Prometheus (CNCF Graduated)
Core monitoring and alerting toolkit:
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus Stack (includes Grafana)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=15d \
--set prometheus.prometheusSpec.scrapeInterval=30s
# Verify installation
kubectl get pods -n monitoring
kubectl get svc -n monitoring
# Port forward Prometheus
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Check Prometheus targets
curl localhost:9090/api/v1/targets | jq .
2. Grafana (CNCF Incubating)
Visualization and dashboarding:
# Access Grafana (if installed with prometheus-stack)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Get Grafana admin password
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
# Import dashboards via API
curl -X POST \
-H "Content-Type: application/json" \
-d '{"dashboard": {...}, "overwrite": true}' \
http://admin:password@localhost:3000/api/dashboards/db
# Check Grafana health
kubectl exec -n monitoring deploy/prometheus-grafana -- curl localhost:3000/api/health
3. Thanos (CNCF Incubating)
For long-term storage and high availability:
# Install Thanos
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install thanos bitnami/thanos \
--namespace monitoring \
--set objstoreConfig="type: s3\nconfig:\n bucket: thanos\n endpoint: s3.amazonaws.com"
# Setup Thanos sidecar in Prometheus
kubectl patch prometheus k8s -n monitoring --type=merge -p '{"spec":{"thanos":{"baseImage":"quay.io/thanos/thanos:v0.31.0"}}}'
# Query metrics via Thanos
thanos query \
--http-address=0.0.0.0:9090 \
--store=thanos-sidecar.monitoring.svc.cluster.local:10901
4. Loki (CNCF Sandbox)
Log aggregation system:
# Install Loki stack
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--set grafana.enabled=false \
--set prometheus.enabled=false
# Add Loki datasource to Grafana
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
loki.yaml: |
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://loki:3100
version: 1
EOF
# Query logs
logcli query '{app="nginx"}' --addr="http://loki:3100"
Monitoring Setup Commands
Basic Monitoring Setup
# Create monitoring namespace
kubectl create namespace monitoring
# Label namespace for monitoring
kubectl label namespace monitoring monitoring=enabled
# Create ServiceMonitor for your app
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: your-app
endpoints:
- port: metrics
EOF
# Check ServiceMonitor status
kubectl get servicemonitor -n monitoring
Alert Management
# Create AlertManager config
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: app-alerts
namespace: monitoring
spec:
route:
receiver: 'slack'
receivers:
- name: 'slack'
slackConfigs:
- apiURL:
key: url
name: slack-url
channel: '#alerts'
EOF
# Check AlertManager config
kubectl get alertmanagerconfig -n monitoring
# View active alerts
curl -s http://localhost:9093/api/v1/alerts | jq .
Recording Rules
# Create recording rules
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-rules
namespace: monitoring
spec:
groups:
- name: app.rules
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
EOF
# Check rules
kubectl get prometheusrule -n monitoring
Dashboard Management
Grafana Dashboard Management
# Export dashboard
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
http://grafana:3000/api/dashboards/uid/$DASHBOARD_UID > dashboard.json
# Import dashboard
curl -X POST \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d @dashboard.json \
http://grafana:3000/api/dashboards/db
# Create dashboard folder
curl -X POST \
-H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"title":"Production"}' \
http://grafana:3000/api/folders
Dashboard Provisioning
# Create dashboard provisioning config
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
data:
dashboards.yaml: |
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
EOF
Monitoring Best Practices
Resource Monitoring
# Monitor Prometheus resource usage
kubectl top pod -n monitoring -l app=prometheus
# Check Prometheus storage
kubectl get pvc -n monitoring
# Monitor Grafana resources
kubectl top pod -n monitoring -l app.kubernetes.io/name=grafana
# Check metrics scrape status
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job:.labels.job, health:.health}'
Performance Optimization
# Check Prometheus query performance
curl -s 'http://localhost:9090/api/v1/query_range?query=rate(prometheus_http_requests_total[5m])&start=1625097600&end=1625184000&step=60' | jq .
# Monitor Prometheus WAL
kubectl exec -it prometheus-prometheus-0 -n monitoring -- df -h /prometheus
# Check Prometheus memory usage
kubectl exec -it prometheus-prometheus-0 -n monitoring -- ps aux | grep prometheus
Backup and Restore
# Backup Prometheus data
kubectl cp prometheus-prometheus-0:/prometheus prometheus-backup -n monitoring
# Backup Grafana
kubectl exec -it prometheus-grafana-0 -n monitoring -- tar czf /tmp/grafana-backup.tar.gz /var/lib/grafana
# Copy backup locally
kubectl cp prometheus-grafana-0:/tmp/grafana-backup.tar.gz grafana-backup.tar.gz -n monitoring
Troubleshooting Commands
# Check Prometheus logs
kubectl logs -f -n monitoring prometheus-prometheus-0
# Check Grafana logs
kubectl logs -f -n monitoring deploy/prometheus-grafana
# Verify Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health=="down")'
# Check AlertManager status
curl -s http://localhost:9093/api/v1/status | jq .
# Test recording rules
curl -s 'http://localhost:9090/api/v1/rules' | jq .
# Verify metrics collection
curl -s http://localhost:9090/api/v1/label/__name__/values | jq .
Integration with Other Tools
1. OpenTelemetry Integration
# Install OpenTelemetry Operator
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml
# Create OpenTelemetry Collector
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel
namespace: monitoring
spec:
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
limit_mib: 1000
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter]
exporters: [prometheus]
EOF
2. Jaeger Integration
# Install Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.41.0/jaeger-operator.yaml -n observability
# Create Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: monitoring
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
EOF
Maintenance Tasks
Daily Checks
# Check monitoring stack health
kubectl get pods -n monitoring
kubectl top pods -n monitoring
# Verify data collection
curl -s "http://localhost:9090/api/v1/query?query=up" | jq .
# Check alert status
curl -s http://localhost:9093/api/v1/alerts | jq .
Weekly Tasks
# Clean up old data
curl -X POST http://localhost:9090/-/clean
# Update dashboards
helm upgrade prometheus prometheus-community/kube-prometheus-stack -n monitoring
# Verify backup jobs
kubectl get cronjob -n monitoring
Monthly Tasks
# Audit monitoring coverage
kubectl get servicemonitor,podmonitor -A
# Review resource allocation
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check long-term storage
thanos tools bucket verify --objstore.config-file=bucket.yaml