Prometheus Recording Rules in Kubernetes
Create and manage Prometheus recording rules for better performance and scalability
Prometheus Recording Rules in Kubernetes
Recording rules in Prometheus allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.
Why Use Recording Rules?
-
Performance Optimization
- Reduce query-time computation
- Decrease load on Prometheus server
- Improve dashboard responsiveness
-
Complex Query Simplification
- Precompute complex PromQL expressions
- Make queries more readable
- Reuse common calculations
Recording Rules Configuration
Basic Structure
groups:
- name: example
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Common Patterns
- Rate Recording
groups:
- name: node_recording_rules
rules:
- record: instance:node_cpu:avg_rate5m
expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
- Error Rate Calculation
groups:
- name: errors_recording_rules
rules:
- record: job:http_errors:ratio_rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
- Memory Usage
groups:
- name: memory_recording_rules
rules:
- record: instance:memory:usage_ratio
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes
Implementation in Kubernetes
1. ConfigMap Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-recording-rules
data:
recording-rules.yaml: |
groups:
- name: kubernetes_resources
rules:
- record: node:container_cpu_usage:avg
expr: avg(rate(container_cpu_usage_seconds_total[5m])) by (node)
- record: node:container_memory_usage:avg
expr: avg(container_memory_working_set_bytes) by (node)
2. Prometheus Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
rule_files:
- /etc/prometheus/recording-rules.yaml
# ... rest of prometheus config
3. Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
spec:
template:
spec:
containers:
- name: prometheus
volumeMounts:
- name: recording-rules
mountPath: /etc/prometheus/recording-rules.yaml
subPath: recording-rules.yaml
volumes:
- name: recording-rules
configMap:
name: prometheus-recording-rules
Best Practices
1. Naming Conventions
# Level:Metric:Operation
instance:request_duration_seconds:avg_rate5m
job:errors_total:sum_rate5m
cluster:memory_usage:ratio
2. Performance Optimization
groups:
# Group related rules together
- name: http_metrics
interval: 5m # Evaluation interval
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_errors:rate5m
expr: sum(rate(http_errors_total[5m])) by (job)
3. Resource Usage Rules
groups:
- name: resource_usage
rules:
- record: instance:cpu:usage_ratio
expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
- record: instance:memory:usage_bytes
expr: node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
- record: instance:disk:usage_ratio
expr: 1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
Monitoring Recording Rules
1. Rule Health Metrics
# Check rule evaluation duration
rate(prometheus_rule_evaluation_duration_seconds_sum[5m])
# Check rule evaluation failures
rate(prometheus_rule_evaluation_failures_total[5m])
2. Dashboard Example
# Grafana Dashboard JSON
{
"panels": [
{
"title": "Rule Evaluation Duration",
"targets": [
{
"expr": "rate(prometheus_rule_evaluation_duration_seconds_sum[5m])",
"legendFormat": "{{rule_group}}"
}
]
}
]
}
Troubleshooting
Common Issues
- High Cardinality
# Bad - Too many time series
- record: instance:http_requests:status_code_path
expr: sum(http_requests_total) by (instance, status_code, path)
# Good - Reduced cardinality
- record: instance:http_requests:status_code
expr: sum(http_requests_total) by (instance, status_code)
- Evaluation Timeouts
# Add timeout to complex queries
- record: job:slow_query:rate5m
expr: max_over_time(rate(slow_query_seconds_total[5m])[1h:5m])
- Memory Usage
# Monitor memory usage
rate(go_memstats_alloc_bytes_total{job="prometheus"}[5m])