Prometheus Remote Storage in Kubernetes

Configure long-term storage solutions for Prometheus metrics in Kubernetes

Prometheus Remote Storage in Kubernetes

Remote storage allows Prometheus to durably store its metrics data for long-term retention and analysis. This guide covers various remote storage options and their implementation in Kubernetes.

Remote Storage Options

1. Thanos

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  thanos:
    baseImage: quay.io/thanos/thanos
    version: v0.31.0
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-config

2. Cortex

global:
  remote_write:
    - url: "http://cortex:9009/api/v1/push"
      remote_timeout: 30s
      write_relabel_configs:
        - source_labels: [__name__]
          regex: 'go_.*'
          action: drop

3. Victoria Metrics

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  remoteWrite:
    - url: "http://victoria-metrics:8428/api/v1/write"
      queueConfig:
        capacity: 10000
        maxShards: 30
        minShards: 1

Implementation Methods

1. Thanos Setup

# thanos-objstore-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: thanos-objstore-config
type: Opaque
stringData:
  thanos.yaml: |
    type: s3
    config:
      bucket: thanos-metrics
      endpoint: s3.amazonaws.com
      access_key: AKIAIOSFODNN7EXAMPLE
      secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      region: us-east-1

2. Cortex Configuration

# cortex-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cortex-config
data:
  cortex.yaml: |
    distributor:
      shard_by_all_labels: true
      pool:
        health_check_ingesters: true
    
    ingester:
      lifecycler:
        ring:
          replication_factor: 3
      chunk_idle_period: 30m
      max_chunk_age: 2h

3. Victoria Metrics Setup

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: victoria-metrics
spec:
  serviceName: victoria-metrics
  replicas: 1
  selector:
    matchLabels:
      app: victoria-metrics
  template:
    metadata:
      labels:
        app: victoria-metrics
    spec:
      containers:
      - name: victoria-metrics
        image: victoriametrics/victoria-metrics
        args:
          - --storageDataPath=/storage
          - --retentionPeriod=1y

Storage Configuration

1. S3 Storage

apiVersion: v1
kind: Secret
metadata:
  name: remote-storage-credentials
type: Opaque
stringData:
  config.yaml: |
    type: S3
    config:
      bucket: metrics-storage
      endpoint: s3.amazonaws.com
      region: us-east-1
      access_key: AKIAXXXXXXXXXXXXXXXX
      secret_key: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

2. GCS Storage

apiVersion: v1
kind: Secret
metadata:
  name: gcs-credentials
type: Opaque
stringData:
  gcs.json: |
    {
      "type": "service_account",
      "project_id": "your-project",
      "private_key_id": "key-id",
      "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
      "client_email": "service-account@project.iam.gserviceaccount.com",
      "client_id": "client-id"
    }

Query Configuration

1. Query Frontend

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query-frontend
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: thanos-query-frontend
        image: quay.io/thanos/thanos:v0.31.0
        args:
        - query-frontend
        - --query-frontend.downstream-url=http://thanos-query:9090

2. Query Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  template:
    spec:
      containers:
      - name: thanos-query
        image: quay.io/thanos/thanos:v0.31.0
        args:
        - query
        - --store=dnssrv+_grpc._tcp.thanos-store-gateway:10901

Retention and Compaction

1. Retention Configuration

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: thanos-compactor
spec:
  template:
    spec:
      containers:
      - name: thanos-compactor
        image: quay.io/thanos/thanos:v0.31.0
        args:
        - compact
        - --retention.resolution-raw=30d
        - --retention.resolution-5m=90d
        - --retention.resolution-1h=1y

2. Compaction Settings

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  retention: 6h
  retentionSize: 10GB
  tsdb:
    outOfOrderTimeWindow: 30m

Monitoring and Alerting

1. Remote Storage Alerts

groups:
- name: RemoteStorageAlerts
  rules:
  - alert: RemoteWriteErrors
    expr: rate(prometheus_remote_storage_failed_samples_total[5m]) > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Remote write errors detected
  
  - alert: RemoteStorageQueueFull
    expr: prometheus_remote_storage_queue_highest_sent_timestamp_seconds
          - prometheus_remote_storage_queue_oldest_unshipped_sample_timestamp_seconds
          > 120
    for: 15m
    labels:
      severity: critical

2. Performance Monitoring

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: remote-storage-monitor
spec:
  selector:
    matchLabels:
      app: prometheus
  endpoints:
  - port: web
    interval: 30s
    metricRelabelings:
    - sourceLabels: [__name__]
      regex: prometheus_remote_storage_.*
      action: keep

Best Practices

  1. Performance Optimization

    • Use appropriate queue configurations
    • Implement proper retention policies
    • Monitor write performance
  2. High Availability

    • Deploy multiple replicas
    • Use cross-zone distribution
    • Implement proper backup strategies
  3. Cost Optimization

    • Configure appropriate retention periods
    • Use compression where possible
    • Monitor storage usage
  4. Security

    • Use encryption at rest
    • Implement proper access controls
    • Regular security audits

Troubleshooting

Common Issues

  1. Write Failures
# Monitor failed writes
rate(prometheus_remote_storage_failed_samples_total[5m])

# Check queue length
prometheus_remote_storage_queue_length
  1. Performance Issues
# Monitor write latency
rate(prometheus_remote_storage_sent_batch_duration_seconds_sum[5m])
/
rate(prometheus_remote_storage_sent_batch_duration_seconds_count[5m])
  1. Storage Issues
# Monitor storage growth
rate(prometheus_tsdb_head_series_created_total[1h])

Additional Resources