Cortex Integration with Prometheus in Kubernetes
Scale Prometheus with Cortex for multi-tenancy and long-term storage
Cortex Integration with Prometheus in Kubernetes
Cortex is a horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus. This guide covers Cortex setup and integration with Prometheus in Kubernetes.
Architecture Components
- Distributor: Handles incoming writes from Prometheus
- Ingester: Writes data to long-term storage
- Query Frontend: Optimizes and schedules queries
- Store Gateway: Handles queries for historical data
- Ruler: Evaluates recording and alerting rules
- Alertmanager: Handles alert notifications
Installation
1. Using Helm
# Add Cortex helm repo
helm repo add cortex-helm https://cortexproject.github.io/cortex-helm-chart
helm repo update
# Install Cortex
helm install cortex cortex-helm/cortex \
--namespace monitoring \
--create-namespace \
--values cortex-values.yaml
2. Basic Configuration
# cortex-values.yaml
ingester:
replicas: 3
persistence:
enabled: true
size: 100Gi
distributor:
replicas: 3
querier:
replicas: 2
ruler:
enabled: true
replicas: 2
store_gateway:
enabled: true
replicas: 2
compactor:
enabled: true
persistence:
size: 100Gi
alertmanager:
enabled: true
replicas: 2
Prometheus Integration
1. Prometheus Configuration
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
remoteWrite:
- url: http://cortex-distributor:9009/api/v1/push
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: 'up|prometheus_.*'
action: keep
2. Storage Configuration
storage:
backend: s3
s3:
bucket_name: cortex-storage
endpoint: s3.amazonaws.com
access_key_id: AKIAIOSFODNN7EXAMPLE
secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
region: us-east-1
Component Setup
1. Distributor Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-distributor-config
data:
cortex.yaml: |
distributor:
ring:
kvstore:
store: consul
consul:
host: consul:8500
shard_by_all_labels: true
pool:
health_check_ingesters: true
2. Ingester Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-ingester-config
data:
cortex.yaml: |
ingester:
lifecycler:
ring:
kvstore:
store: consul
replication_factor: 3
chunk_idle_period: 30m
max_chunk_age: 2h
chunk_target_size: 1536000
max_transfer_retries: 10
3. Query Frontend Setup
apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-query-frontend-config
data:
cortex.yaml: |
query_frontend:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_bytes: 1073741824
split_queries_by_interval: 24h
Multi-tenancy Configuration
1. Authentication Setup
auth_enabled: true
auth:
type: enterprise
enterprise:
url: http://auth:9090/api/users
client_id: cortex
client_secret: secret
2. Tenant Configuration
limits:
per_user_override_config: /etc/cortex/overrides.yaml
per_user_override_period: 10s
overrides:
tenant1:
ingestion_rate: 10000
ingestion_burst_size: 20000
max_series_per_metric: 100000
max_series_per_query: 100000
High Availability Setup
1. Replication Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cortex-ha-config
data:
cortex.yaml: |
distributor:
replication_factor: 3
ingester:
lifecycler:
ring:
replication_factor: 3
2. Zone-Aware Setup
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cortex-ingester
spec:
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: name
operator: In
values:
- cortex-ingester
topologyKey: kubernetes.io/hostname
Monitoring and Alerting
1. Component Monitoring
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cortex-components
spec:
selector:
matchLabels:
app: cortex
endpoints:
- port: http-metrics
path: /metrics
interval: 15s
2. Alert Rules
groups:
- name: cortex.rules
rules:
- alert: CortexIngesterUnhealthy
expr: |
min(cortex_ring_members{state="ACTIVE", name="ingester"}) without(instance)
< replication_factor
for: 15m
labels:
severity: critical
annotations:
summary: Cortex ingester unhealthy
- alert: CortexRequestErrors
expr: |
100 * sum(rate(cortex_request_duration_seconds_count{status_code=~"5.."}[1m]))
/
sum(rate(cortex_request_duration_seconds_count[1m]))
> 1
for: 15m
labels:
severity: warning
annotations:
summary: Cortex request errors
Query and Visualization
1. Grafana Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
data:
prometheus.yaml: |
apiVersion: 1
datasources:
- name: Cortex
type: prometheus
url: http://cortex-query-frontend:9009/prometheus
access: proxy
isDefault: true
2. Example PromQL Queries
# Query with tenant context
sum by(job) (rate(http_requests_total{tenant="tenant1"}[5m]))
# Cross-tenant query (if authorized)
sum by(tenant, job) (rate(http_requests_total[5m]))
Best Practices
-
Storage Management
- Configure appropriate retention
- Monitor storage usage
- Implement bucket lifecycle
-
Query Optimization
- Use query caching
- Configure query limits
- Monitor query performance
-
Resource Management
- Set appropriate limits
- Monitor component health
- Scale based on metrics
-
Security
- Enable authentication
- Configure RBAC
- Implement network policies
Troubleshooting
Common Issues
- Ingestion Issues
# Check ingestion rate
rate(cortex_distributor_received_samples_total[5m])
# Check ingestion errors
rate(cortex_discarded_samples_total[5m])
- Query Issues
# Check query latency
histogram_quantile(0.99, sum(rate(cortex_query_frontend_query_duration_seconds_bucket[5m])) by (le))
# Check query errors
rate(cortex_query_frontend_queries_failed_total[5m])
- Storage Issues
# Check chunk operations
rate(cortex_chunk_store_index_entries_per_chunk[5m])
# Check store errors
rate(cortex_storage_request_duration_seconds_count{status_code=~"5.."}[5m])