Prometheus High Availability in Kubernetes

Configure Prometheus for high availability and reliability in production environments

Prometheus High Availability in Kubernetes

High Availability (HA) is crucial for maintaining continuous monitoring in production environments. This guide covers various strategies for implementing Prometheus HA in Kubernetes.

Architecture Overview

Components for HA Setup

  1. Multiple Prometheus instances
  2. Load balancer
  3. Remote storage
  4. Alert manager cluster
  5. Service discovery

Implementation Methods

1. Basic HA Setup with Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  replicas: 2
  serviceMonitorSelector:
    matchLabels:
      team: frontend
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 100Gi

2. Advanced Configuration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  replicas: 3
  retention: 15d
  serviceAccountName: prometheus
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceMonitorSelector:
    matchLabels:
      prometheus: main
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast
        resources:
          requests:
            storage: 100Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - prometheus
        topologyKey: kubernetes.io/hostname

Remote Storage Configuration

1. Thanos Integration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  thanos:
    baseImage: quay.io/thanos/thanos
    version: v0.31.0
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-config

2. Remote Write Configuration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-ha
spec:
  remoteWrite:
    - url: "http://cortex-distributor:9090/api/v1/push"
      writeRelabelConfigs:
        - sourceLabels: [__name__]
          regex: '{job="high-cardinality-metrics"}'
          action: drop

Load Balancing

1. Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: prometheus-ha
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  type: LoadBalancer
  ports:
    - port: 9090
      targetPort: 9090
      protocol: TCP
  selector:
    app: prometheus

2. Ingress Configuration

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: prometheus-ha
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
    - host: prometheus.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: prometheus-ha
                port:
                  number: 9090

Alert Manager HA

1. Alert Manager Cluster

apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: alertmanager-ha
spec:
  replicas: 3
  alertmanagerConfigSelector:
    matchLabels:
      alertmanager: main
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 10Gi

2. Alert Manager Configuration

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alertmanager-ha-config
  labels:
    alertmanager: main
spec:
  route:
    groupBy: ['job']
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: 'slack'
  receivers:
  - name: 'slack'
    slackConfigs:
    - channel: '#alerts'
      apiURL:
        key: slack-url
        name: slack-secret

Service Discovery

1. Kubernetes Service Discovery

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service-monitor
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
  - port: http
    interval: 15s
    path: /metrics

2. Custom Service Discovery

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-sd-config
data:
  custom-sd.yaml: |
    - targets:
      - 10.0.0.1:9100
      - 10.0.0.2:9100
      labels:
        env: production
        datacenter: us-east-1

Backup and Recovery

1. Snapshot Configuration

apiVersion: batch/v1
kind: CronJob
metadata:
  name: prometheus-backup
spec:
  schedule: "0 1 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: prometheus-backup
            image: curlimages/curl
            command:
            - /bin/sh
            - -c
            - |
              curl -X POST http://prometheus-ha:9090/-/snapshot
              # Copy snapshot to backup storage

2. Recovery Process

# Restore from snapshot
curl -X POST -F 'snapshot=@/path/to/snapshot' http://prometheus-ha:9090/api/v1/admin/tsdb/snapshot/restore

Monitoring the HA Setup

1. Prometheus Self-monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prometheus-self
spec:
  selector:
    matchLabels:
      app: prometheus
  endpoints:
  - port: web

2. Grafana Dashboard

{
  "dashboard": {
    "panels": [
      {
        "title": "Prometheus Replica Status",
        "targets": [
          {
            "expr": "up{job=\"prometheus\"}"
          }
        ]
      },
      {
        "title": "TSDB Head Series",
        "targets": [
          {
            "expr": "prometheus_tsdb_head_series"
          }
        ]
      }
    ]
  }
}

Best Practices

  1. Resource Management

    • Set appropriate resource requests and limits
    • Use node anti-affinity for replica distribution
    • Implement proper storage class selection
  2. Monitoring and Alerting

    • Monitor Prometheus itself
    • Set up alerts for replica failures
    • Track storage usage and growth
  3. Backup and Recovery

    • Regular snapshot creation
    • Backup verification
    • Documented recovery procedures
  4. Security

    • Enable TLS encryption
    • Implement proper RBAC
    • Regular security updates

Additional Resources