Cortex Integration with Prometheus in Kubernetes

Scale Prometheus with Cortex for multi-tenancy and long-term storage

Cortex Integration with Prometheus in Kubernetes

Cortex is a horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus. This guide covers Cortex setup and integration with Prometheus in Kubernetes.

Architecture Components

  1. Distributor: Handles incoming writes from Prometheus
  2. Ingester: Writes data to long-term storage
  3. Query Frontend: Optimizes and schedules queries
  4. Store Gateway: Handles queries for historical data
  5. Ruler: Evaluates recording and alerting rules
  6. Alertmanager: Handles alert notifications

Installation

1. Using Helm

# Add Cortex helm repo
helm repo add cortex-helm https://cortexproject.github.io/cortex-helm-chart
helm repo update

# Install Cortex
helm install cortex cortex-helm/cortex \
  --namespace monitoring \
  --create-namespace \
  --values cortex-values.yaml

2. Basic Configuration

# cortex-values.yaml
ingester:
  replicas: 3
  persistence:
    enabled: true
    size: 100Gi

distributor:
  replicas: 3

querier:
  replicas: 2

ruler:
  enabled: true
  replicas: 2

store_gateway:
  enabled: true
  replicas: 2

compactor:
  enabled: true
  persistence:
    size: 100Gi

alertmanager:
  enabled: true
  replicas: 2

Prometheus Integration

1. Prometheus Configuration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  remoteWrite:
    - url: http://cortex-distributor:9009/api/v1/push
      writeRelabelConfigs:
        - sourceLabels: [__name__]
          regex: 'up|prometheus_.*'
          action: keep

2. Storage Configuration

storage:
  backend: s3
  s3:
    bucket_name: cortex-storage
    endpoint: s3.amazonaws.com
    access_key_id: AKIAIOSFODNN7EXAMPLE
    secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    region: us-east-1

Component Setup

1. Distributor Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-distributor-config
data:
  cortex.yaml: |
    distributor:
      ring:
        kvstore:
          store: consul
          consul:
            host: consul:8500
      shard_by_all_labels: true
      pool:
        health_check_ingesters: true

2. Ingester Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-ingester-config
data:
  cortex.yaml: |
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: consul
          replication_factor: 3
      chunk_idle_period: 30m
      max_chunk_age: 2h
      chunk_target_size: 1536000
      max_transfer_retries: 10

3. Query Frontend Setup

apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-query-frontend-config
data:
  cortex.yaml: |
    query_frontend:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_bytes: 1073741824
      split_queries_by_interval: 24h

Multi-tenancy Configuration

1. Authentication Setup

auth_enabled: true

auth:
  type: enterprise
  enterprise:
    url: http://auth:9090/api/users
    client_id: cortex
    client_secret: secret

2. Tenant Configuration

limits:
  per_user_override_config: /etc/cortex/overrides.yaml
  per_user_override_period: 10s

overrides:
  tenant1:
    ingestion_rate: 10000
    ingestion_burst_size: 20000
    max_series_per_metric: 100000
    max_series_per_query: 100000

High Availability Setup

1. Replication Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-ha-config
data:
  cortex.yaml: |
    distributor:
      replication_factor: 3
      
    ingester:
      lifecycler:
        ring:
          replication_factor: 3

2. Zone-Aware Setup

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cortex-ingester
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: name
                operator: In
                values:
                - cortex-ingester
            topologyKey: kubernetes.io/hostname

Monitoring and Alerting

1. Component Monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cortex-components
spec:
  selector:
    matchLabels:
      app: cortex
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 15s

2. Alert Rules

groups:
- name: cortex.rules
  rules:
  - alert: CortexIngesterUnhealthy
    expr: |
      min(cortex_ring_members{state="ACTIVE", name="ingester"}) without(instance)
      < replication_factor
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: Cortex ingester unhealthy
  
  - alert: CortexRequestErrors
    expr: |
      100 * sum(rate(cortex_request_duration_seconds_count{status_code=~"5.."}[1m]))
      /
      sum(rate(cortex_request_duration_seconds_count[1m]))
      > 1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Cortex request errors

Query and Visualization

1. Grafana Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
data:
  prometheus.yaml: |
    apiVersion: 1
    datasources:
    - name: Cortex
      type: prometheus
      url: http://cortex-query-frontend:9009/prometheus
      access: proxy
      isDefault: true

2. Example PromQL Queries

# Query with tenant context
sum by(job) (rate(http_requests_total{tenant="tenant1"}[5m]))

# Cross-tenant query (if authorized)
sum by(tenant, job) (rate(http_requests_total[5m]))

Best Practices

  1. Storage Management

    • Configure appropriate retention
    • Monitor storage usage
    • Implement bucket lifecycle
  2. Query Optimization

    • Use query caching
    • Configure query limits
    • Monitor query performance
  3. Resource Management

    • Set appropriate limits
    • Monitor component health
    • Scale based on metrics
  4. Security

    • Enable authentication
    • Configure RBAC
    • Implement network policies

Troubleshooting

Common Issues

  1. Ingestion Issues
# Check ingestion rate
rate(cortex_distributor_received_samples_total[5m])

# Check ingestion errors
rate(cortex_discarded_samples_total[5m])
  1. Query Issues
# Check query latency
histogram_quantile(0.99, sum(rate(cortex_query_frontend_query_duration_seconds_bucket[5m])) by (le))

# Check query errors
rate(cortex_query_frontend_queries_failed_total[5m])
  1. Storage Issues
# Check chunk operations
rate(cortex_chunk_store_index_entries_per_chunk[5m])

# Check store errors
rate(cortex_storage_request_duration_seconds_count{status_code=~"5.."}[5m])

Additional Resources