Prometheus AlertManager in Kubernetes

Complete guide for setting up and configuring AlertManager with Prometheus in Kubernetes

Prometheus AlertManager in Kubernetes

This guide provides detailed instructions for setting up and configuring AlertManager in Kubernetes, enabling sophisticated alert handling and notification routing.

Video Tutorial

Learn more about Prometheus AlertManager in Kubernetes in this comprehensive video tutorial:

What is AlertManager?

AlertManager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or Slack.

Installation

1. Using Helm

# Add Prometheus repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install AlertManager
helm install alertmanager prometheus-community/alertmanager \
  --namespace monitoring \
  --create-namespace

2. Using Manifests

# alertmanager.yaml
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: main
  namespace: monitoring
spec:
  replicas: 3
  configSecret: alertmanager-config
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 10Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager@example.com'
      smtp_auth_password: 'password'
      slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'

    route:
      group_by: ['job', 'alertname', 'severity']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
      - match:
          severity: warning
        receiver: 'slack-notifications'
      - match:
          severity: info
        receiver: 'email-notifications'

    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#alerts'
        send_resolved: true
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
            *Alert:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            *Severity:* {{ .Labels.severity }}
            *Start:* {{ .StartsAt }}
          {{ end }}

    - name: 'pagerduty-critical'
      pagerduty_configs:
      - service_key: '<your-pagerduty-service-key>'
        send_resolved: true

    - name: 'email-notifications'
      email_configs:
      - to: 'team@example.com'
        send_resolved: true

    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'instance']

Alert Rules Configuration

1. Basic Alert Rules

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes
    rules:
    - alert: HighCPUUsage
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: High CPU usage on {{ $labels.instance }}
        description: CPU usage is above 80% for 5 minutes

    - alert: HighMemoryUsage
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: High memory usage on {{ $labels.instance }}
        description: Memory usage is above 85% for 5 minutes

    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
      for: 15m
      labels:
        severity: critical
      annotations:
        summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping
        description: Pod has restarted several times in the last 15 minutes

2. Advanced Alert Rules

# advanced-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: advanced-alerts
  namespace: monitoring
spec:
  groups:
  - name: advanced
    rules:
    - alert: AbnormalLatency
      expr: |
        histogram_quantile(0.95, 
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
        ) > 2
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: High latency in {{ $labels.service }}
        description: 95th percentile latency is above 2 seconds

    - alert: ErrorRateHigh
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) 
        / 
        sum(rate(http_requests_total[5m])) * 100 > 5
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: High error rate detected
        description: Error rate is above 5% for 5 minutes

    - alert: CertificateExpiringSoon
      expr: |
        (
          ssl_certificate_expiry_seconds 
          - 
          time()
        ) / 86400 < 30
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: SSL certificate expiring soon
        description: Certificate will expire in less than 30 days

Notification Templates

1. Slack Templates

slack_configs:
- channel: '#alerts'
  send_resolved: true
  title_link: '{{ template "slack.default.titlelink" . }}'
  title: |
    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
  text: |
    {{ range .Alerts }}
      *Alert:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Details:*
      {{ range .Labels.SortedPairs }}
        • *{{ .Name }}:* `{{ .Value }}`
      {{ end }}
      *Start:* {{ .StartsAt | date "2006-01-02T15:04:05Z07:00" }}
    {{ end }}

2. Email Templates

templates:
- '/etc/alertmanager/templates/*.tmpl'

email_configs:
- to: 'team@example.com'
  html: '{{ template "email.custom.html" . }}'
  headers:
    subject: |
      [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
<!-- email.tmpl -->
{{ define "email.custom.html" }}
<!DOCTYPE html>
<html>
<head>
  <style>
    body { font-family: Arial, sans-serif; }
    .alert { margin: 20px; padding: 15px; border: 1px solid #ddd; }
    .firing { background-color: #ffebee; }
    .resolved { background-color: #e8f5e9; }
  </style>
</head>
<body>
  <h2>
    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] 
    {{ .GroupLabels.alertname }}
  </h2>
  
  {{ range .Alerts }}
  <div class="alert {{ .Status }}">
    <h3>{{ .Annotations.summary }}</h3>
    <p>{{ .Annotations.description }}</p>
    <h4>Labels:</h4>
    <ul>
    {{ range .Labels.SortedPairs }}
      <li><strong>{{ .Name }}:</strong> {{ .Value }}</li>
    {{ end }}
    </ul>
    <p><strong>Start:</strong> {{ .StartsAt }}</p>
    {{ if .EndsAt }}
    <p><strong>End:</strong> {{ .EndsAt }}</p>
    {{ end }}
  </div>
  {{ end }}
</body>
</html>
{{ end }}

Integration Examples

1. PagerDuty Integration

receivers:
- name: 'pagerduty'
  pagerduty_configs:
  - routing_key: '<your-pagerduty-routing-key>'
    send_resolved: true
    severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
    description: '{{ .CommonAnnotations.description }}'
    client: 'AlertManager'
    client_url: '{{ template "pagerduty.default.clientURL" . }}'
    details:
      firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
      resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
      num_firing: '{{ .Alerts.Firing | len }}'

2. Microsoft Teams Integration

receivers:
- name: 'msteams'
  webhook_configs:
  - url: 'https://outlook.office.com/webhook/...'
    send_resolved: true
    title: '{{ template "msteams.default.title" . }}'
    text: |
      {{ range .Alerts }}
        **Alert:** {{ .Annotations.summary }}
        **Description:** {{ .Annotations.description }}
        **Severity:** {{ .Labels.severity }}
        **Start:** {{ .StartsAt }}
      {{ end }}

Additional Alert Templates

1. Opsgenie Template

receivers:
- name: 'opsgenie'
  opsgenie_configs:
  - api_key: '<your-api-key>'
    message: '{{ template "opsgenie.default.message" . }}'
    description: '{{ template "opsgenie.default.description" . }}'
    source: 'Prometheus'
    responders:
    - name: 'DevOps'
      type: 'team'
    - name: 'SRE'
      type: 'team'
    tags: ['{{ .GroupLabels.severity }}', '{{ .GroupLabels.cluster }}']
    note: '{{ template "opsgenie.default.note" . }}'
    priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else if eq .GroupLabels.severity "warning" }}P2{{ else }}P3{{ end }}'

2. VictorOps Template

receivers:
- name: 'victorops'
  victorops_configs:
  - api_key: '<your-api-key>'
    routing_key: 'monitoring'
    message_type: '{{ if eq .Status "firing" }}CRITICAL{{ else }}RECOVERY{{ end }}'
    entity_display_name: '{{ .GroupLabels.alertname }}'
    state_message: |-
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Severity: {{ .Labels.severity }}
      Start: {{ .StartsAt }}
      {{ end }}
    custom_fields:
      alert_url: '{{ template "victorops.default.alertURL" . }}'
      num_firing: '{{ .Alerts.Firing | len }}'
      num_resolved: '{{ .Alerts.Resolved | len }}'

3. Discord Template

receivers:
- name: 'discord'
  webhook_configs:
  - url: 'https://discord.com/api/webhooks/...'
    send_resolved: true
    title: |-
      [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] 
      {{ .GroupLabels.alertname }}
    message: |-
      {{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} **Alert Status: {{ .Status | toUpper }}**
      
      {{ range .Alerts }}
      **Alert:** {{ .Annotations.summary }}
      **Description:** {{ .Annotations.description }}
      **Severity:** {{ .Labels.severity }}
      **Start:** {{ .StartsAt }}
      
      **Labels:**
      {{ range .Labels.SortedPairs }}
      > **{{ .Name }}:** {{ .Value }}
      {{ end }}
      
      ---
      {{ end }}

4. Telegram Template

receivers:
- name: 'telegram'
  telegram_configs:
  - bot_token: '<your-bot-token>'
    chat_id: <your-chat-id>
    parse_mode: 'HTML'
    message: |-
      {{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} <b>{{ .Status | toUpper }}</b>
      
      <b>Alert:</b> {{ .GroupLabels.alertname }}
      {{ range .Alerts }}
      <b>Summary:</b> {{ .Annotations.summary }}
      <b>Description:</b> {{ .Annotations.description }}
      <b>Severity:</b> {{ .Labels.severity }}
      <b>Start:</b> {{ .StartsAt }}
      
      <b>Labels:</b>
      {{ range .Labels.SortedPairs }}
      • <b>{{ .Name }}:</b> <code>{{ .Value }}</code>
      {{ end }}
      
      ---
      {{ end }}

5. Twilio SMS Template

receivers:
- name: 'twilio'
  webhook_configs:
  - url: 'https://twilio-alertmanager-adapter.example.com/alert'
    http_config:
      basic_auth:
        username: '<account-sid>'
        password: '<auth-token>'
    send_resolved: true
    max_alerts: 5
    text: |-
      {{ if eq .Status "firing" }}ALERT{{ else }}RESOLVED{{ end }}
      {{ .GroupLabels.alertname }}:
      {{ range .Alerts }}
      - {{ .Annotations.summary }}
      {{ end }}

6. ServiceNow Template

receivers:
- name: 'servicenow'
  webhook_configs:
  - url: 'https://servicenow-adapter.example.com/alert'
    http_config:
      basic_auth:
        username: '<username>'
        password: '<password>'
    send_resolved: true
    custom_data:
      incident:
        impact: '{{ if eq .GroupLabels.severity "critical" }}1{{ else if eq .GroupLabels.severity "warning" }}2{{ else }}3{{ end }}'
        urgency: '{{ if eq .GroupLabels.severity "critical" }}1{{ else if eq .GroupLabels.severity "warning" }}2{{ else }}3{{ end }}'
        short_description: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        description: |-
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          Severity: {{ .Labels.severity }}
          Start: {{ .StartsAt }}
          {{ end }}

7. WeChat Template

receivers:
- name: 'wechat'
  wechat_configs:
  - corp_id: '<your-corp-id>'
    api_secret: '<your-api-secret>'
    agent_id: '<your-agent-id>'
    to_party: '<party-id>'
    message: |-
      {{ if eq .Status "firing" }}【告警】{{ else }}【恢复】{{ end }}
      
      告警名称:{{ .GroupLabels.alertname }}
      {{ range .Alerts }}
      概述:{{ .Annotations.summary }}
      描述:{{ .Annotations.description }}
      级别:{{ .Labels.severity }}
      开始时间:{{ .StartsAt }}
      
      标签:
      {{ range .Labels.SortedPairs }}
      - {{ .Name }}: {{ .Value }}
      {{ end }}
      
      ---
      {{ end }}

8. Matrix Template

receivers:
- name: 'matrix'
  webhook_configs:
  - url: 'https://matrix-alertmanager-adapter.example.com/alert'
    http_config:
      bearer_token: '<access-token>'
    send_resolved: true
    text: |-
      {{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .Status | toUpper }}
      
      **Alert:** {{ .GroupLabels.alertname }}
      {{ range .Alerts }}
      **Summary:** {{ .Annotations.summary }}
      **Description:** {{ .Annotations.description }}
      **Severity:** {{ .Labels.severity }}
      **Start:** {{ .StartsAt }}
      
      **Labels:**
      {{ range .Labels.SortedPairs }}
      • **{{ .Name }}:** `{{ .Value }}`
      {{ end }}
      
      ---
      {{ end }}

9. Pushover Template

receivers:
- name: 'pushover'
  pushover_configs:
  - user_key: '<user-key>'
    token: '<api-token>'
    priority: '{{ if eq .GroupLabels.severity "critical" }}2{{ else if eq .GroupLabels.severity "warning" }}1{{ else }}0{{ end }}'
    title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
    message: |-
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      Severity: {{ .Labels.severity }}
      Start: {{ .StartsAt }}
      {{ end }}
    retry: 30
    expire: 3600

10. Custom Webhook Template

receivers:
- name: 'custom-webhook'
  webhook_configs:
  - url: 'https://custom-webhook.example.com/alert'
    send_resolved: true
    http_config:
      bearer_token: '<your-token>'
    max_alerts: 10
    custom_data:
      version: '1.0'
      generator: 'Prometheus AlertManager'
      status: '{{ .Status }}'
      groupKey: '{{ .GroupKey }}'
      groupLabels: {{ .GroupLabels | toJson }}
      commonLabels: {{ .CommonLabels | toJson }}
      commonAnnotations: {{ .CommonAnnotations | toJson }}
      alerts:
      {{ range .Alerts }}
      - status: '{{ .Status }}'
        labels: {{ .Labels | toJson }}
        annotations: {{ .Annotations | toJson }}
        startsAt: '{{ .StartsAt }}'
        {{ if .EndsAt }}endsAt: '{{ .EndsAt }}'{{ end }}
        generatorURL: '{{ .GeneratorURL }}'
      {{ end }}

Route Configuration Examples

1. Multi-Tenant Routing

route:
  receiver: 'default'
  group_by: ['tenant', 'alertname']
  routes:
  - match:
      tenant: 'team-a'
    receiver: 'team-a-slack'
    continue: true
  - match:
      tenant: 'team-b'
    receiver: 'team-b-pagerduty'
    continue: true
  - match:
      tenant: 'team-c'
    receiver: 'team-c-email'

2. Time-Based Routing

route:
  receiver: 'default'
  group_by: ['alertname']
  routes:
  - match:
      severity: 'critical'
    receiver: 'pagerduty'
    continue: true
  - match:
      severity: 'warning'
    receiver: 'slack'
    continue: true
    mute_time_intervals:
    - name: 'non-working-hours'
  - match:
      severity: 'info'
    receiver: 'email'
    mute_time_intervals:
    - name: 'non-working-hours'

mute_time_intervals:
- name: 'non-working-hours'
  time_intervals:
  - weekdays: ['Saturday', 'Sunday']
  - times:
    - start_time: '17:00'
      end_time: '09:00'

3. Region-Based Routing

route:
  receiver: 'default'
  group_by: ['region', 'alertname']
  routes:
  - match:
      region: 'us-east'
    receiver: 'us-team'
    routes:
    - match:
        severity: 'critical'
      receiver: 'us-pagerduty'
    - match:
        severity: 'warning'
      receiver: 'us-slack'
  - match:
      region: 'eu-west'
    receiver: 'eu-team'
    routes:
    - match:
        severity: 'critical'
      receiver: 'eu-pagerduty'
    - match:
        severity: 'warning'
      receiver: 'eu-slack'

High Availability Setup

# alertmanager-ha.yaml
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: ha-alertmanager
  namespace: monitoring
spec:
  replicas: 3
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - alertmanager
      topologyKey: kubernetes.io/hostname
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: standard
        resources:
          requests:
            storage: 50Gi

Best Practices

1. Alert Grouping

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

2. Inhibition Rules

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'cluster', 'service']

3. Resource Management

resources:
  requests:
    memory: "200Mi"
    cpu: "100m"
  limits:
    memory: "500Mi"
    cpu: "300m"

Troubleshooting

Common Issues and Solutions

  1. Alert Not Firing
# Check AlertManager configuration
kubectl get secret alertmanager-main -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d

# Check AlertManager logs
kubectl logs -n monitoring alertmanager-main-0

# Check Prometheus rules
kubectl get prometheusrules -n monitoring
  1. Notification Issues
# Check AlertManager status
kubectl port-forward -n monitoring alertmanager-main-0 9093:9093
# Visit http://localhost:9093/#/status

# Test alert delivery
curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"TestAlert"}}]' http://localhost:9093/api/v1/alerts

Additional Resources