Thanos Integration with Prometheus in Kubernetes

Scale Prometheus with Thanos for global view and long-term storage

Thanos Integration with Prometheus in Kubernetes

Thanos extends Prometheus capabilities by providing unlimited retention, high availability, and global query view across multiple Prometheus instances. This guide covers Thanos setup and configuration in Kubernetes.

Architecture Components

  1. Thanos Sidecar: Uploads metrics to object storage
  2. Thanos Store: Provides access to historical metrics
  3. Thanos Query: Global query view across Prometheus instances
  4. Thanos Compactor: Compacts and downsamples stored metrics
  5. Thanos Ruler: Evaluates recording and alerting rules

Installation

1. Using Helm

# Add bitnami repo
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Install Thanos
helm install thanos bitnami/thanos \
  --namespace monitoring \
  --create-namespace \
  --values thanos-values.yaml

2. Basic Configuration

# thanos-values.yaml
objstoreConfig:
  type: s3
  config:
    bucket: thanos-metrics
    endpoint: s3.amazonaws.com
    access_key: AKIAIOSFODNN7EXAMPLE
    secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    region: us-east-1

query:
  enabled: true
  replicaCount: 2
  
store:
  enabled: true
  replicaCount: 2

compactor:
  enabled: true
  retentionResolutionRaw: 30d
  retentionResolution5m: 90d
  retentionResolution1h: 1y

Prometheus Integration

1. Prometheus Configuration with Thanos Sidecar

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 2
  thanos:
    baseImage: quay.io/thanos/thanos
    version: v0.31.0
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-config

2. Object Storage Configuration

apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore-config
type: Opaque
stringData:
  thanos.yaml: |
    type: s3
    config:
      bucket: thanos-metrics
      endpoint: s3.amazonaws.com
      access_key: AKIAIOSFODNN7EXAMPLE
      secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      region: us-east-1

Component Setup

1. Thanos Query

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: thanos-query
        image: quay.io/thanos/thanos:v0.31.0
        args:
        - query
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:9090
        - --store=dnssrv+_grpc._tcp.thanos-store-gateway:10901
        - --store=dnssrv+_grpc._tcp.thanos-sidecar:10901
        ports:
        - name: http
          containerPort: 9090
        - name: grpc
          containerPort: 10901

2. Thanos Store

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: thanos-store-gateway
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: thanos-store
        image: quay.io/thanos/thanos:v0.31.0
        args:
        - store
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --data-dir=/var/thanos/store
        - --objstore.config-file=/etc/thanos/objstore.yml
        volumeMounts:
        - name: thanos-store-data
          mountPath: /var/thanos/store
        - name: thanos-objstore
          mountPath: /etc/thanos
          readOnly: true

3. Thanos Compactor

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-compactor
spec:
  template:
    spec:
      containers:
      - name: thanos-compactor
        image: quay.io/thanos/thanos:v0.31.0
        args:
        - compact
        - --data-dir=/var/thanos/compact
        - --objstore.config-file=/etc/thanos/objstore.yml
        - --retention.resolution-raw=30d
        - --retention.resolution-5m=90d
        - --retention.resolution-1h=1y
        volumeMounts:
        - name: data
          mountPath: /var/thanos/compact
        - name: thanos-objstore
          mountPath: /etc/thanos

Query and Visualization

1. Grafana Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
data:
  prometheus.yaml: |
    apiVersion: 1
    datasources:
    - name: Thanos
      type: prometheus
      url: http://thanos-query:9090
      access: proxy
      isDefault: true

2. Example PromQL Queries

# Query with deduplication
sum(rate(http_requests_total[5m])) without (replica)

# Long-term query with auto-downsampling
sum(rate(http_requests_total[1d])) by (job)

High Availability Setup

1. Multi-Cluster Configuration

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus-eu
spec:
  thanos:
    baseImage: quay.io/thanos/thanos
    version: v0.31.0
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-config
    additionalArgs:
      - --cluster=eu-west

2. Cross-Region Query

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  template:
    spec:
      containers:
      - args:
        - query
        - --store=dnssrv+_grpc._tcp.thanos-store-eu-west:10901
        - --store=dnssrv+_grpc._tcp.thanos-store-us-east:10901

Monitoring and Alerting

1. Thanos Alerts

groups:
- name: thanos-component-alerts
  rules:
  - alert: ThanosCompactHalted
    expr: thanos_compact_halted == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Compact has halted
  
  - alert: ThanosQueryHighDNSFailures
    expr: rate(thanos_query_store_nodes_dns_failures_total[5m]) > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query is having DNS resolution issues

2. Performance Monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: thanos-components
spec:
  selector:
    matchLabels:
      app: thanos
  endpoints:
  - port: http
    interval: 15s

Best Practices

  1. Storage Configuration

    • Use appropriate retention periods
    • Configure bucket lifecycle policies
    • Monitor storage usage
  2. Query Optimization

    • Use appropriate time ranges
    • Implement deduplication
    • Monitor query performance
  3. High Availability

    • Deploy multiple replicas
    • Use cross-region setup
    • Implement proper backup
  4. Resource Management

    • Set appropriate limits
    • Monitor component health
    • Scale based on metrics

Troubleshooting

Common Issues

  1. Store Issues
# Check store errors
rate(thanos_store_bucket_operations_failed_total[5m])

# Check series fetch duration
histogram_quantile(0.99, rate(thanos_bucket_store_series_fetch_duration_seconds_bucket[5m]))
  1. Query Issues
# Check query errors
rate(thanos_query_errors_total[5m])

# Check query duration
histogram_quantile(0.99, rate(thanos_query_duration_seconds_bucket[5m]))
  1. Compaction Issues
# Check compaction errors
rate(thanos_compact_group_compactions_failures_total[5m])

# Check compaction duration
thanos_compact_group_compactions_total

Additional Resources