Prometheus Federation in Kubernetes
Complete guide for implementing Prometheus Federation across multiple Kubernetes clusters
Setting Up Prometheus Federation in Kubernetes
This guide provides detailed instructions for implementing Prometheus Federation across multiple Kubernetes clusters, enabling centralized monitoring and alerting.
What is Prometheus Federation?
Prometheus Federation allows you to collect metrics from multiple Prometheus servers into a central Prometheus server. This is particularly useful for:
- Multi-cluster monitoring
- Cross-datacenter monitoring
- Hierarchical monitoring setups
- Global view of metrics
Architecture Overview
Hierarchical Federation
Global Prometheus
↑
|
+-------------+-------------+
↑ ↑ ↑
Regional Prom Regional Prom Regional Prom
↑ ↑ ↑
| | |
Cluster 1 Cluster 2 Cluster 3
Cross-Cluster Federation
Cluster A Cluster B
+--------+ +--------+
|Prom A |←-----------→|Prom B |
+--------+ +--------+
↑ ↑
| |
Services A Services B
Prerequisites
- Multiple Kubernetes clusters
- Prometheus installed on each cluster
- Network connectivity between clusters
- kubectl and helm configured for all clusters
- SSL certificates for secure communication
Implementation Steps
1. Global Prometheus Configuration
# global-prometheus-config.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: global-prometheus
namespace: monitoring
spec:
replicas: 2
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
federation: "true"
resources:
requests:
memory: 4Gi
cpu: 2
limits:
memory: 8Gi
cpu: 4
storage:
volumeClaimTemplate:
spec:
storageClassName: standard
resources:
requests:
storage: 100Gi
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
2. Federation Service Monitors
# federation-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prometheus-federation
namespace: monitoring
labels:
federation: "true"
spec:
endpoints:
- interval: 30s
scrapeTimeout: 25s
path: /federate
params:
'match[]':
- '{job=~".+"}' # Adjust based on your needs
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/prometheus-ca/tls.crt
certFile: /etc/prometheus/secrets/prometheus-client/tls.crt
keyFile: /etc/prometheus/secrets/prometheus-client/tls.key
serverName: prometheus.monitoring.svc
selector:
matchLabels:
prometheus: federated
3. Cross-Cluster Communication
# cross-cluster-endpoint.yaml
apiVersion: v1
kind: Endpoints
metadata:
name: federated-prometheus
namespace: monitoring
subsets:
- addresses:
- ip: "10.0.0.1" # Remote Prometheus IP
ports:
- name: web
port: 9090
protocol: TCP
# cross-cluster-service.yaml
apiVersion: v1
kind: Service
metadata:
name: federated-prometheus
namespace: monitoring
spec:
ports:
- name: web
port: 9090
protocol: TCP
targetPort: web
type: ClusterIP
4. Federation Rules
# federation-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: federation-rules
namespace: monitoring
spec:
groups:
- name: federation
rules:
- record: cluster:node_cpu:ratio_rate5m
expr: |
sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (cluster)
/
sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (cluster)
- record: cluster:node_memory:ratio
expr: |
sum(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) by (cluster)
/
sum(node_memory_MemTotal_bytes) by (cluster)
5. Secure Communication Setup
# prometheus-tls-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: prometheus-tls
namespace: monitoring
type: kubernetes.io/tls
data:
tls.crt: base64_encoded_cert
tls.key: base64_encoded_key
ca.crt: base64_encoded_ca
6. Federation Job Configuration
# federation-job-config.yaml
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~".+"}'
static_configs:
- targets:
- 'prometheus-k8s-0.prometheus-operated:9090'
- 'prometheus-k8s-1.prometheus-operated:9090'
tls_config:
ca_file: /etc/prometheus/secrets/prometheus-ca/tls.crt
cert_file: /etc/prometheus/secrets/prometheus-client/tls.crt
key_file: /etc/prometheus/secrets/prometheus-client/tls.key
server_name: prometheus.monitoring.svc
Advanced Configurations
1. Multi-Region Setup
# multi-region-config.yaml
global:
external_labels:
region: us-east-1
cluster: prod-1
scrape_configs:
- job_name: 'federate-regions'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{region=~"us-.*"}'
- '{region=~"eu-.*"}'
static_configs:
- targets:
- 'prometheus-us-east.example.com'
- 'prometheus-us-west.example.com'
- 'prometheus-eu-central.example.com'
2. High Availability Configuration
# ha-federation-config.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: ha-prometheus
spec:
replicas: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: prometheus
operator: In
values:
- ha-prometheus
topologyKey: kubernetes.io/hostname
storage:
volumeClaimTemplate:
spec:
storageClassName: fast
resources:
requests:
storage: 100Gi
3. Custom Retention and Compaction
# retention-config.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: federated-prometheus
spec:
retention: 90d
retentionSize: 500GB
tsdb:
outOfOrderTimeWindow: 12h
minBlockDuration: 2h
maxBlockDuration: 24h
Federation Patterns
1. Hub and Spoke Pattern
Hub Prometheus
↑
+--------+--------+
↑ ↑ ↑
Spoke Prom Spoke Prom Spoke Prom
↑ ↑ ↑
Cluster 1 Cluster 2 Cluster 3
Configuration Example:
# hub-prometheus-config.yaml
scrape_configs:
- job_name: 'federate-spokes'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*"}'
- '{job="node-exporter"}'
static_configs:
- targets:
- 'spoke1-prometheus:9090'
- 'spoke2-prometheus:9090'
- 'spoke3-prometheus:9090'
relabel_configs:
- source_labels: [__address__]
target_label: cluster
regex: '(.*)-prometheus:9090'
replacement: '$1'
2. Mesh Federation Pattern
Prom A ←→ Prom B
↕ ↕
Prom C ←→ Prom D
Configuration Example:
# mesh-federation-config.yaml
scrape_configs:
- job_name: 'federate-mesh'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"critical-.*"}'
static_configs:
- targets:
- 'prometheus-a:9090'
- 'prometheus-b:9090'
- 'prometheus-c:9090'
- 'prometheus-d:9090'
relabel_configs:
- source_labels: [__address__]
target_label: peer
regex: '(.+):9090'
replacement: '$1'
3. Hierarchical Sharding Pattern
Global Prometheus
↑
+-----------+------------+
↑ ↑ ↑
Shard 1 Shard 2 Shard 3
(Region A) (Region B) (Region C)
↑ ↑ ↑
Clusters Clusters Clusters
Configuration Example:
# shard-federation-config.yaml
scrape_configs:
- job_name: 'federate-shards'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{shard="1"}'
- '{shard="2"}'
- '{shard="3"}'
static_configs:
- targets:
- 'shard1-prometheus:9090'
- 'shard2-prometheus:9090'
- 'shard3-prometheus:9090'
relabel_configs:
- source_labels: [__address__]
target_label: shard
regex: 'shard(\d+)-prometheus:9090'
replacement: '$1'
Enhanced Multi-Region Setup
1. Global Load Balancing
# global-lb-config.yaml
apiVersion: v1
kind: Service
metadata:
name: global-prometheus
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
spec:
type: LoadBalancer
ports:
- port: 9090
targetPort: 9090
selector:
app: prometheus
component: global
2. Regional Configuration with AWS
# aws-regions-config.yaml
global:
external_labels:
region: ${AWS_REGION}
environment: production
cloud_provider: aws
scrape_configs:
- job_name: 'federate-aws-regions'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*",region=~"us-.*"}'
- '{job=~"kubernetes-.*",region=~"eu-.*"}'
- '{job=~"kubernetes-.*",region=~"ap-.*"}'
ec2_sd_configs:
- region: us-east-1
port: 9090
filters:
- name: tag:Component
values: [prometheus]
- region: eu-west-1
port: 9090
filters:
- name: tag:Component
values: [prometheus]
- region: ap-southeast-1
port: 9090
filters:
- name: tag:Component
values: [prometheus]
relabel_configs:
- source_labels: [__meta_ec2_availability_zone]
target_label: zone
- source_labels: [__meta_ec2_region]
target_label: region
3. Regional Configuration with GCP
# gcp-regions-config.yaml
global:
external_labels:
region: ${GCP_REGION}
environment: production
cloud_provider: gcp
scrape_configs:
- job_name: 'federate-gcp-regions'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*",region=~"us-.*"}'
- '{job=~"kubernetes-.*",region=~"europe-.*"}'
- '{job=~"kubernetes-.*",region=~"asia-.*"}'
gce_sd_configs:
- project: your-project
zone: us-central1-a
filter: 'labels.component="prometheus"'
- project: your-project
zone: europe-west1-b
filter: 'labels.component="prometheus"'
- project: your-project
zone: asia-east1-a
filter: 'labels.component="prometheus"'
relabel_configs:
- source_labels: [__meta_gce_zone]
target_label: zone
- source_labels: [__meta_gce_project]
target_label: project
4. Regional Configuration with Azure
# azure-regions-config.yaml
global:
external_labels:
region: ${AZURE_REGION}
environment: production
cloud_provider: azure
scrape_configs:
- job_name: 'federate-azure-regions'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*",region=~"eastus.*"}'
- '{job=~"kubernetes-.*",region=~"westeurope.*"}'
- '{job=~"kubernetes-.*",region=~"southeastasia.*"}'
azure_sd_configs:
- subscription_id: your-subscription-id
tenant_id: your-tenant-id
client_id: your-client-id
client_secret: your-client-secret
port: 9090
resource_group: prometheus-rg
relabel_configs:
- source_labels: [__meta_azure_machine_location]
target_label: region
- source_labels: [__meta_azure_machine_resource_group]
target_label: resource_group
5. Cross-Cloud Federation
# cross-cloud-federation.yaml
scrape_configs:
- job_name: 'federate-cross-cloud'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*"}'
static_configs:
- targets:
- 'aws-prometheus.example.com:9090'
- 'gcp-prometheus.example.com:9090'
- 'azure-prometheus.example.com:9090'
relabel_configs:
- source_labels: [__address__]
regex: '(aws|gcp|azure)-prometheus.*'
target_label: cloud_provider
replacement: '$1'
6. Regional Data Retention Policies
# regional-retention-config.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: regional-prometheus
spec:
retention: 15d
retentionSize: 100GB
tsdb:
outOfOrderTimeWindow: 6h
# Regional instances keep less data
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 4
memory: 8Gi
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: global-prometheus
spec:
retention: 90d
retentionSize: 500GB
tsdb:
outOfOrderTimeWindow: 12h
# Global instance keeps more historical data
resources:
requests:
cpu: 4
memory: 8Gi
limits:
cpu: 8
memory: 16Gi
Best Practices
1. Resource Management
resources:
requests:
cpu: 4
memory: 8Gi
limits:
cpu: 8
memory: 16Gi
2. Network Policies
# federation-network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus-federation
spec:
podSelector:
matchLabels:
app: prometheus
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9090
3. Data Retention
# retention-config.yaml
spec:
retention: 30d
retentionSize: 300GB
tsdb:
outOfOrderTimeWindow: 6h
Troubleshooting
1. Federation Connectivity Issues
# Check federation endpoints
kubectl get endpoints -n monitoring
# Test federation scrape
curl -k https://prometheus-federate:9090/federate?match[]={job="kubernetes-nodes"}
# Check TLS certificates
kubectl get secrets -n monitoring prometheus-tls -o yaml
2. Performance Issues
# Check TSDB stats
curl -s http://localhost:9090/api/v1/status/tsdb
# Monitor scrape performance
rate(prometheus_target_scrapes_exceeded_sample_limit_total[5m])
# Check storage usage
kubectl exec -it prometheus-k8s-0 -n monitoring -- df -h /prometheus
3. Configuration Validation
# Validate configuration
promtool check config prometheus.yml
# Check ServiceMonitor status
kubectl get servicemonitor -n monitoring
# Verify scrape targets
kubectl port-forward svc/prometheus-operated 9090:9090 -n monitoring
# Visit http://localhost:9090/targets
Monitoring Federation
1. Federation Health Metrics
# Scrape duration
rate(prometheus_target_scrape_pool_sync_total[5m])
# Failed scrapes
rate(prometheus_target_scrapes_failed_total[5m])
# Sample ingestion rate
rate(prometheus_tsdb_head_samples_appended_total[5m])
2. Storage Metrics
# TSDB size
prometheus_tsdb_storage_blocks_bytes
# Compaction duration
rate(prometheus_tsdb_compaction_duration_seconds_sum[5m])
# WAL corruptions
prometheus_tsdb_wal_corruptions_total