Prometheus AlertManager in Kubernetes
Complete guide for setting up and configuring AlertManager with Prometheus in Kubernetes
Prometheus AlertManager in Kubernetes
This guide provides detailed instructions for setting up and configuring AlertManager in Kubernetes, enabling sophisticated alert handling and notification routing.
Video Tutorial
Learn more about Prometheus AlertManager in Kubernetes in this comprehensive video tutorial:
What is AlertManager?
AlertManager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing alerts to the correct receiver integration such as email, PagerDuty, or Slack.
Installation
1. Using Helm
# Add Prometheus repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install AlertManager
helm install alertmanager prometheus-community/alertmanager \
--namespace monitoring \
--create-namespace
2. Using Manifests
# alertmanager.yaml
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: main
namespace: monitoring
spec:
replicas: 3
configSecret: alertmanager-config
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
type: Opaque
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
slack_api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
route:
group_by: ['job', 'alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-notifications'
- match:
severity: info
receiver: 'email-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Start:* {{ .StartsAt }}
{{ end }}
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<your-pagerduty-service-key>'
send_resolved: true
- name: 'email-notifications'
email_configs:
- to: 'team@example.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Alert Rules Configuration
1. Basic Alert Rules
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: High CPU usage on {{ $labels.instance }}
description: CPU usage is above 80% for 5 minutes
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: High memory usage on {{ $labels.instance }}
description: Memory usage is above 85% for 5 minutes
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: critical
annotations:
summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping
description: Pod has restarted several times in the last 15 minutes
2. Advanced Alert Rules
# advanced-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: advanced-alerts
namespace: monitoring
spec:
groups:
- name: advanced
rules:
- alert: AbnormalLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: High latency in {{ $labels.service }}
description: 95th percentile latency is above 2 seconds
- alert: ErrorRateHigh
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
description: Error rate is above 5% for 5 minutes
- alert: CertificateExpiringSoon
expr: |
(
ssl_certificate_expiry_seconds
-
time()
) / 86400 < 30
for: 1h
labels:
severity: warning
annotations:
summary: SSL certificate expiring soon
description: Certificate will expire in less than 30 days
Notification Templates
1. Slack Templates
slack_configs:
- channel: '#alerts'
send_resolved: true
title_link: '{{ template "slack.default.titlelink" . }}'
title: |
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }}
• *{{ .Name }}:* `{{ .Value }}`
{{ end }}
*Start:* {{ .StartsAt | date "2006-01-02T15:04:05Z07:00" }}
{{ end }}
2. Email Templates
templates:
- '/etc/alertmanager/templates/*.tmpl'
email_configs:
- to: 'team@example.com'
html: '{{ template "email.custom.html" . }}'
headers:
subject: |
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.alertname }}
<!-- email.tmpl -->
{{ define "email.custom.html" }}
<!DOCTYPE html>
<html>
<head>
<style>
body { font-family: Arial, sans-serif; }
.alert { margin: 20px; padding: 15px; border: 1px solid #ddd; }
.firing { background-color: #ffebee; }
.resolved { background-color: #e8f5e9; }
</style>
</head>
<body>
<h2>
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.alertname }}
</h2>
{{ range .Alerts }}
<div class="alert {{ .Status }}">
<h3>{{ .Annotations.summary }}</h3>
<p>{{ .Annotations.description }}</p>
<h4>Labels:</h4>
<ul>
{{ range .Labels.SortedPairs }}
<li><strong>{{ .Name }}:</strong> {{ .Value }}</li>
{{ end }}
</ul>
<p><strong>Start:</strong> {{ .StartsAt }}</p>
{{ if .EndsAt }}
<p><strong>End:</strong> {{ .EndsAt }}</p>
{{ end }}
</div>
{{ end }}
</body>
</html>
{{ end }}
Integration Examples
1. PagerDuty Integration
receivers:
- name: 'pagerduty'
pagerduty_configs:
- routing_key: '<your-pagerduty-routing-key>'
send_resolved: true
severity: '{{ if eq .GroupLabels.severity "critical" }}critical{{ else }}warning{{ end }}'
description: '{{ .CommonAnnotations.description }}'
client: 'AlertManager'
client_url: '{{ template "pagerduty.default.clientURL" . }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
num_firing: '{{ .Alerts.Firing | len }}'
2. Microsoft Teams Integration
receivers:
- name: 'msteams'
webhook_configs:
- url: 'https://outlook.office.com/webhook/...'
send_resolved: true
title: '{{ template "msteams.default.title" . }}'
text: |
{{ range .Alerts }}
**Alert:** {{ .Annotations.summary }}
**Description:** {{ .Annotations.description }}
**Severity:** {{ .Labels.severity }}
**Start:** {{ .StartsAt }}
{{ end }}
Additional Alert Templates
1. Opsgenie Template
receivers:
- name: 'opsgenie'
opsgenie_configs:
- api_key: '<your-api-key>'
message: '{{ template "opsgenie.default.message" . }}'
description: '{{ template "opsgenie.default.description" . }}'
source: 'Prometheus'
responders:
- name: 'DevOps'
type: 'team'
- name: 'SRE'
type: 'team'
tags: ['{{ .GroupLabels.severity }}', '{{ .GroupLabels.cluster }}']
note: '{{ template "opsgenie.default.note" . }}'
priority: '{{ if eq .GroupLabels.severity "critical" }}P1{{ else if eq .GroupLabels.severity "warning" }}P2{{ else }}P3{{ end }}'
2. VictorOps Template
receivers:
- name: 'victorops'
victorops_configs:
- api_key: '<your-api-key>'
routing_key: 'monitoring'
message_type: '{{ if eq .Status "firing" }}CRITICAL{{ else }}RECOVERY{{ end }}'
entity_display_name: '{{ .GroupLabels.alertname }}'
state_message: |-
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Start: {{ .StartsAt }}
{{ end }}
custom_fields:
alert_url: '{{ template "victorops.default.alertURL" . }}'
num_firing: '{{ .Alerts.Firing | len }}'
num_resolved: '{{ .Alerts.Resolved | len }}'
3. Discord Template
receivers:
- name: 'discord'
webhook_configs:
- url: 'https://discord.com/api/webhooks/...'
send_resolved: true
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.alertname }}
message: |-
{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} **Alert Status: {{ .Status | toUpper }}**
{{ range .Alerts }}
**Alert:** {{ .Annotations.summary }}
**Description:** {{ .Annotations.description }}
**Severity:** {{ .Labels.severity }}
**Start:** {{ .StartsAt }}
**Labels:**
{{ range .Labels.SortedPairs }}
> **{{ .Name }}:** {{ .Value }}
{{ end }}
---
{{ end }}
4. Telegram Template
receivers:
- name: 'telegram'
telegram_configs:
- bot_token: '<your-bot-token>'
chat_id: <your-chat-id>
parse_mode: 'HTML'
message: |-
{{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} <b>{{ .Status | toUpper }}</b>
<b>Alert:</b> {{ .GroupLabels.alertname }}
{{ range .Alerts }}
<b>Summary:</b> {{ .Annotations.summary }}
<b>Description:</b> {{ .Annotations.description }}
<b>Severity:</b> {{ .Labels.severity }}
<b>Start:</b> {{ .StartsAt }}
<b>Labels:</b>
{{ range .Labels.SortedPairs }}
• <b>{{ .Name }}:</b> <code>{{ .Value }}</code>
{{ end }}
---
{{ end }}
5. Twilio SMS Template
receivers:
- name: 'twilio'
webhook_configs:
- url: 'https://twilio-alertmanager-adapter.example.com/alert'
http_config:
basic_auth:
username: '<account-sid>'
password: '<auth-token>'
send_resolved: true
max_alerts: 5
text: |-
{{ if eq .Status "firing" }}ALERT{{ else }}RESOLVED{{ end }}
{{ .GroupLabels.alertname }}:
{{ range .Alerts }}
- {{ .Annotations.summary }}
{{ end }}
6. ServiceNow Template
receivers:
- name: 'servicenow'
webhook_configs:
- url: 'https://servicenow-adapter.example.com/alert'
http_config:
basic_auth:
username: '<username>'
password: '<password>'
send_resolved: true
custom_data:
incident:
impact: '{{ if eq .GroupLabels.severity "critical" }}1{{ else if eq .GroupLabels.severity "warning" }}2{{ else }}3{{ end }}'
urgency: '{{ if eq .GroupLabels.severity "critical" }}1{{ else if eq .GroupLabels.severity "warning" }}2{{ else }}3{{ end }}'
short_description: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
description: |-
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Start: {{ .StartsAt }}
{{ end }}
7. WeChat Template
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: '<your-corp-id>'
api_secret: '<your-api-secret>'
agent_id: '<your-agent-id>'
to_party: '<party-id>'
message: |-
{{ if eq .Status "firing" }}【告警】{{ else }}【恢复】{{ end }}
告警名称:{{ .GroupLabels.alertname }}
{{ range .Alerts }}
概述:{{ .Annotations.summary }}
描述:{{ .Annotations.description }}
级别:{{ .Labels.severity }}
开始时间:{{ .StartsAt }}
标签:
{{ range .Labels.SortedPairs }}
- {{ .Name }}: {{ .Value }}
{{ end }}
---
{{ end }}
8. Matrix Template
receivers:
- name: 'matrix'
webhook_configs:
- url: 'https://matrix-alertmanager-adapter.example.com/alert'
http_config:
bearer_token: '<access-token>'
send_resolved: true
text: |-
{{ if eq .Status "firing" }}🔥{{ else }}✅{{ end }} {{ .Status | toUpper }}
**Alert:** {{ .GroupLabels.alertname }}
{{ range .Alerts }}
**Summary:** {{ .Annotations.summary }}
**Description:** {{ .Annotations.description }}
**Severity:** {{ .Labels.severity }}
**Start:** {{ .StartsAt }}
**Labels:**
{{ range .Labels.SortedPairs }}
• **{{ .Name }}:** `{{ .Value }}`
{{ end }}
---
{{ end }}
9. Pushover Template
receivers:
- name: 'pushover'
pushover_configs:
- user_key: '<user-key>'
token: '<api-token>'
priority: '{{ if eq .GroupLabels.severity "critical" }}2{{ else if eq .GroupLabels.severity "warning" }}1{{ else }}0{{ end }}'
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
message: |-
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Severity: {{ .Labels.severity }}
Start: {{ .StartsAt }}
{{ end }}
retry: 30
expire: 3600
10. Custom Webhook Template
receivers:
- name: 'custom-webhook'
webhook_configs:
- url: 'https://custom-webhook.example.com/alert'
send_resolved: true
http_config:
bearer_token: '<your-token>'
max_alerts: 10
custom_data:
version: '1.0'
generator: 'Prometheus AlertManager'
status: '{{ .Status }}'
groupKey: '{{ .GroupKey }}'
groupLabels: {{ .GroupLabels | toJson }}
commonLabels: {{ .CommonLabels | toJson }}
commonAnnotations: {{ .CommonAnnotations | toJson }}
alerts:
{{ range .Alerts }}
- status: '{{ .Status }}'
labels: {{ .Labels | toJson }}
annotations: {{ .Annotations | toJson }}
startsAt: '{{ .StartsAt }}'
{{ if .EndsAt }}endsAt: '{{ .EndsAt }}'{{ end }}
generatorURL: '{{ .GeneratorURL }}'
{{ end }}
Route Configuration Examples
1. Multi-Tenant Routing
route:
receiver: 'default'
group_by: ['tenant', 'alertname']
routes:
- match:
tenant: 'team-a'
receiver: 'team-a-slack'
continue: true
- match:
tenant: 'team-b'
receiver: 'team-b-pagerduty'
continue: true
- match:
tenant: 'team-c'
receiver: 'team-c-email'
2. Time-Based Routing
route:
receiver: 'default'
group_by: ['alertname']
routes:
- match:
severity: 'critical'
receiver: 'pagerduty'
continue: true
- match:
severity: 'warning'
receiver: 'slack'
continue: true
mute_time_intervals:
- name: 'non-working-hours'
- match:
severity: 'info'
receiver: 'email'
mute_time_intervals:
- name: 'non-working-hours'
mute_time_intervals:
- name: 'non-working-hours'
time_intervals:
- weekdays: ['Saturday', 'Sunday']
- times:
- start_time: '17:00'
end_time: '09:00'
3. Region-Based Routing
route:
receiver: 'default'
group_by: ['region', 'alertname']
routes:
- match:
region: 'us-east'
receiver: 'us-team'
routes:
- match:
severity: 'critical'
receiver: 'us-pagerduty'
- match:
severity: 'warning'
receiver: 'us-slack'
- match:
region: 'eu-west'
receiver: 'eu-team'
routes:
- match:
severity: 'critical'
receiver: 'eu-pagerduty'
- match:
severity: 'warning'
receiver: 'eu-slack'
High Availability Setup
# alertmanager-ha.yaml
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: ha-alertmanager
namespace: monitoring
spec:
replicas: 3
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- alertmanager
topologyKey: kubernetes.io/hostname
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 50Gi
Best Practices
1. Alert Grouping
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
2. Inhibition Rules
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
3. Resource Management
resources:
requests:
memory: "200Mi"
cpu: "100m"
limits:
memory: "500Mi"
cpu: "300m"
Troubleshooting
Common Issues and Solutions
- Alert Not Firing
# Check AlertManager configuration
kubectl get secret alertmanager-main -n monitoring -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d
# Check AlertManager logs
kubectl logs -n monitoring alertmanager-main-0
# Check Prometheus rules
kubectl get prometheusrules -n monitoring
- Notification Issues
# Check AlertManager status
kubectl port-forward -n monitoring alertmanager-main-0 9093:9093
# Visit http://localhost:9093/#/status
# Test alert delivery
curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"TestAlert"}}]' http://localhost:9093/api/v1/alerts