Prometheus High Availability in Kubernetes

High Availability (HA) is crucial for maintaining continuous monitoring in production environments. This guide covers various strategies for implementing Prometheus HA in Kubernetes.

Architecture Overview

Components for HA Setup

Multiple Prometheus instances
Load balancer
Remote storage
Alert manager cluster
Service discovery

Implementation Methods

1. Basic HA Setup with Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  replicas: 2
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 100Gi

2. Advanced Configuration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  replicas: 3
  retention: 15d
  serviceAccountName: prometheus
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceMonitorSelector:
    matchLabels:
      prometheus: main
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast
        resources:
          requests:
            storage: 100Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - prometheus
        topologyKey: kubernetes.io/hostname

Remote Storage Configuration

1. Thanos Integration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  thanos:
    baseImage: quay.io/thanos/thanos
    version: v0.31.0
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-config

2. Remote Write Configuration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  remoteWrite:
    - url: "http://cortex-distributor:9090/api/v1/push"
      writeRelabelConfigs:
        - sourceLabels: [__name__]
          regex: '{job="high-cardinality-metrics"}'
          action: drop

Load Balancing

1. Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: prometheus-ha
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  type: LoadBalancer
  ports:
    - port: 9090
      targetPort: 9090
      protocol: TCP
  selector:
    app: prometheus

2. Ingress Configuration

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-ha
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
    - host: prometheus.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-ha
                port:
                  number: 9090

Alert Manager HA

1. Alert Manager Cluster

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: alertmanager-ha
spec:
  replicas: 3
  alertmanagerConfigSelector:
    matchLabels:
      alertmanager: main
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 10Gi

2. Alert Manager Configuration

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alertmanager-ha-config
  labels:
    alertmanager: main
spec:
  route:
    groupBy: ['job']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: 'slack'
  receivers:
  - name: 'slack'
    slackConfigs:
    - channel: '#alerts'
      apiURL:
        key: slack-url
        name: slack-secret

Service Discovery

1. Kubernetes Service Discovery

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service-monitor
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
  - port: http
    interval: 15s
    path: /metrics

2. Custom Service Discovery

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-sd-config
data:
  custom-sd.yaml: |
    - targets:
      - 10.0.0.1:9100
      - 10.0.0.2:9100
      labels:
        env: production
        datacenter: us-east-1

Backup and Recovery

1. Snapshot Configuration

apiVersion: batch/v1
kind: CronJob
metadata:
  name: prometheus-backup
spec:
  schedule: "0 1 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: prometheus-backup
            image: curlimages/curl
            command:
            - /bin/sh
            - -c
            - |
              curl -X POST http://prometheus-ha:9090/-/snapshot
              # Copy snapshot to backup storage

2. Recovery Process

# Restore from snapshot
curl -X POST -F 'snapshot=@/path/to/snapshot' http://prometheus-ha:9090/api/v1/admin/tsdb/snapshot/restore

Monitoring the HA Setup

1. Prometheus Self-monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-self
spec:
  selector:
    matchLabels:
      app: prometheus
  endpoints:
  - port: web

2. Grafana Dashboard

{
  "dashboard": {
    "panels": [
      {
        "title": "Prometheus Replica Status",
        "targets": [
          {
            "expr": "up{job=\"prometheus\"}"
          }
        ]
      },
      {
        "title": "TSDB Head Series",
        "targets": [
          {
            "expr": "prometheus_tsdb_head_series"
          }
        ]
      }
    ]
  }
}

Best Practices

Resource Management
- Set appropriate resource requests and limits
- Use node anti-affinity for replica distribution
- Implement proper storage class selection
Monitoring and Alerting
- Monitor Prometheus itself
- Set up alerts for replica failures
- Track storage usage and growth
Backup and Recovery
- Regular snapshot creation
- Backup verification
- Documented recovery procedures
Security
- Enable TLS encryption
- Implement proper RBAC
- Regular security updates

Prometheus High Availability in Kubernetes

Prometheus High Availability in Kubernetes

Architecture Overview

Components for HA Setup

Implementation Methods

1. Basic HA Setup with Prometheus Operator

2. Advanced Configuration

Remote Storage Configuration

1. Thanos Integration

2. Remote Write Configuration

Load Balancing

1. Service Configuration

2. Ingress Configuration

Alert Manager HA

1. Alert Manager Cluster

2. Alert Manager Configuration

Service Discovery

1. Kubernetes Service Discovery

2. Custom Service Discovery

Backup and Recovery

1. Snapshot Configuration

2. Recovery Process

Monitoring the HA Setup

1. Prometheus Self-monitoring

2. Grafana Dashboard

Best Practices

Additional Resources