Prometheus High Availability in Kubernetes
Configure Prometheus for high availability and reliability in production environments
Prometheus High Availability in Kubernetes
High Availability (HA) is crucial for maintaining continuous monitoring in production environments. This guide covers various strategies for implementing Prometheus HA in Kubernetes.
Architecture Overview
Components for HA Setup
- Multiple Prometheus instances
- Load balancer
- Remote storage
- Alert manager cluster
- Service discovery
Implementation Methods
1. Basic HA Setup with Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-ha
spec:
replicas: 2
serviceMonitorSelector:
matchLabels:
team: frontend
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 100Gi
2. Advanced Configuration
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-ha
spec:
replicas: 3
retention: 15d
serviceAccountName: prometheus
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceMonitorSelector:
matchLabels:
prometheus: main
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
storage:
volumeClaimTemplate:
spec:
storageClassName: fast
resources:
requests:
storage: 100Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- prometheus
topologyKey: kubernetes.io/hostname
Remote Storage Configuration
1. Thanos Integration
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-ha
spec:
thanos:
baseImage: quay.io/thanos/thanos
version: v0.31.0
objectStorageConfig:
key: thanos.yaml
name: thanos-objstore-config
2. Remote Write Configuration
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-ha
spec:
remoteWrite:
- url: "http://cortex-distributor:9090/api/v1/push"
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: '{job="high-cardinality-metrics"}'
action: drop
Load Balancing
1. Service Configuration
apiVersion: v1
kind: Service
metadata:
name: prometheus-ha
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
type: LoadBalancer
ports:
- port: 9090
targetPort: 9090
protocol: TCP
selector:
app: prometheus
2. Ingress Configuration
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus-ha
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: prometheus.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-ha
port:
number: 9090
Alert Manager HA
1. Alert Manager Cluster
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: alertmanager-ha
spec:
replicas: 3
alertmanagerConfigSelector:
matchLabels:
alertmanager: main
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 10Gi
2. Alert Manager Configuration
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: alertmanager-ha-config
labels:
alertmanager: main
spec:
route:
groupBy: ['job']
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
receiver: 'slack'
receivers:
- name: 'slack'
slackConfigs:
- channel: '#alerts'
apiURL:
key: slack-url
name: slack-secret
Service Discovery
1. Kubernetes Service Discovery
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service-monitor
spec:
selector:
matchLabels:
app: api
endpoints:
- port: http
interval: 15s
path: /metrics
2. Custom Service Discovery
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-sd-config
data:
custom-sd.yaml: |
- targets:
- 10.0.0.1:9100
- 10.0.0.2:9100
labels:
env: production
datacenter: us-east-1
Backup and Recovery
1. Snapshot Configuration
apiVersion: batch/v1
kind: CronJob
metadata:
name: prometheus-backup
spec:
schedule: "0 1 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: prometheus-backup
image: curlimages/curl
command:
- /bin/sh
- -c
- |
curl -X POST http://prometheus-ha:9090/-/snapshot
# Copy snapshot to backup storage
2. Recovery Process
# Restore from snapshot
curl -X POST -F 'snapshot=@/path/to/snapshot' http://prometheus-ha:9090/api/v1/admin/tsdb/snapshot/restore
Monitoring the HA Setup
1. Prometheus Self-monitoring
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prometheus-self
spec:
selector:
matchLabels:
app: prometheus
endpoints:
- port: web
2. Grafana Dashboard
{
"dashboard": {
"panels": [
{
"title": "Prometheus Replica Status",
"targets": [
{
"expr": "up{job=\"prometheus\"}"
}
]
},
{
"title": "TSDB Head Series",
"targets": [
{
"expr": "prometheus_tsdb_head_series"
}
]
}
]
}
}
Best Practices
-
Resource Management
- Set appropriate resource requests and limits
- Use node anti-affinity for replica distribution
- Implement proper storage class selection
-
Monitoring and Alerting
- Monitor Prometheus itself
- Set up alerts for replica failures
- Track storage usage and growth
-
Backup and Recovery
- Regular snapshot creation
- Backup verification
- Documented recovery procedures
-
Security
- Enable TLS encryption
- Implement proper RBAC
- Regular security updates