Prometheus Recording Rules in Kubernetes

Create and manage Prometheus recording rules for better performance and scalability

Prometheus Recording Rules in Kubernetes

Recording rules in Prometheus allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.

Why Use Recording Rules?

  1. Performance Optimization

    • Reduce query-time computation
    • Decrease load on Prometheus server
    • Improve dashboard responsiveness
  2. Complex Query Simplification

    • Precompute complex PromQL expressions
    • Make queries more readable
    • Reuse common calculations

Recording Rules Configuration

Basic Structure

groups:
  - name: example
    rules:
    - record: job:http_requests:rate5m
      expr: sum(rate(http_requests_total[5m])) by (job)

Common Patterns

  1. Rate Recording
groups:
  - name: node_recording_rules
    rules:
    - record: instance:node_cpu:avg_rate5m
      expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)
  1. Error Rate Calculation
groups:
  - name: errors_recording_rules
    rules:
    - record: job:http_errors:ratio_rate5m
      expr: |
        sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
        /
        sum(rate(http_requests_total[5m])) by (job)
  1. Memory Usage
groups:
  - name: memory_recording_rules
    rules:
    - record: instance:memory:usage_ratio
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
            / node_memory_MemTotal_bytes

Implementation in Kubernetes

1. ConfigMap Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-recording-rules
data:
  recording-rules.yaml: |
    groups:
      - name: kubernetes_resources
        rules:
        - record: node:container_cpu_usage:avg
          expr: avg(rate(container_cpu_usage_seconds_total[5m])) by (node)
        - record: node:container_memory_usage:avg
          expr: avg(container_memory_working_set_bytes) by (node)

2. Prometheus Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    rule_files:
      - /etc/prometheus/recording-rules.yaml
    # ... rest of prometheus config

3. Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
spec:
  template:
    spec:
      containers:
      - name: prometheus
        volumeMounts:
        - name: recording-rules
          mountPath: /etc/prometheus/recording-rules.yaml
          subPath: recording-rules.yaml
      volumes:
      - name: recording-rules
        configMap:
          name: prometheus-recording-rules

Best Practices

1. Naming Conventions

# Level:Metric:Operation
instance:request_duration_seconds:avg_rate5m
job:errors_total:sum_rate5m
cluster:memory_usage:ratio

2. Performance Optimization

groups:
  # Group related rules together
  - name: http_metrics
    interval: 5m  # Evaluation interval
    rules:
    - record: job:http_requests:rate5m
      expr: sum(rate(http_requests_total[5m])) by (job)
    - record: job:http_errors:rate5m
      expr: sum(rate(http_errors_total[5m])) by (job)

3. Resource Usage Rules

groups:
  - name: resource_usage
    rules:
    - record: instance:cpu:usage_ratio
      expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
    
    - record: instance:memory:usage_bytes
      expr: node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
    
    - record: instance:disk:usage_ratio
      expr: 1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

Monitoring Recording Rules

1. Rule Health Metrics

# Check rule evaluation duration
rate(prometheus_rule_evaluation_duration_seconds_sum[5m])

# Check rule evaluation failures
rate(prometheus_rule_evaluation_failures_total[5m])

2. Dashboard Example

# Grafana Dashboard JSON
{
  "panels": [
    {
      "title": "Rule Evaluation Duration",
      "targets": [
        {
          "expr": "rate(prometheus_rule_evaluation_duration_seconds_sum[5m])",
          "legendFormat": "{{rule_group}}"
        }
      ]
    }
  ]
}

Troubleshooting

Common Issues

  1. High Cardinality
# Bad - Too many time series
- record: instance:http_requests:status_code_path
  expr: sum(http_requests_total) by (instance, status_code, path)

# Good - Reduced cardinality
- record: instance:http_requests:status_code
  expr: sum(http_requests_total) by (instance, status_code)
  1. Evaluation Timeouts
# Add timeout to complex queries
- record: job:slow_query:rate5m
  expr: max_over_time(rate(slow_query_seconds_total[5m])[1h:5m])
  1. Memory Usage
# Monitor memory usage
rate(go_memstats_alloc_bytes_total{job="prometheus"}[5m])

Additional Resources