Kubernetes Cost Optimization - A Practical Guide

Managing costs in Kubernetes environments is a critical challenge faced by organizations of all sizes. This comprehensive guide presents battle-tested strategies and practical implementations for optimizing your Kubernetes costs without compromising performance or reliability.

Understanding Cost Components

Before implementing optimization strategies, it’s crucial to understand your Kubernetes cost structure:

Core Cost Drivers

  1. Compute Resources: CPU and memory usage across nodes
  2. Storage: Persistent volumes and their associated costs
  3. Network: Data transfer between zones/regions
  4. Management Overhead: Control plane and monitoring costs

Cost Analysis Example

Let’s analyze a typical scenario of a microservices application:

Production Environment Example:
├── Frontend Service: 10 pods × 0.5 CPU, 1Gi RAM
├── Backend APIs: 15 pods × 1 CPU, 2Gi RAM
├── Cache Layer: 3 pods × 2 CPU, 4Gi RAM
└── Database: 2 pods × 4 CPU, 8Gi RAM

Monthly Cost Breakdown:
- Compute: $2,500
- Storage: $800
- Network: $400
- Management: $200
Total: $3,900/month

Implementation Requirements

Technical Prerequisites

  • Access to a Kubernetes cluster (v1.24+)
  • kubectl CLI tool installed (v1.24+)
  • Helm package manager (v3.0+)
  • A cost monitoring solution (e.g., Kubecost, OpenCost)
  • Cloud provider access (if using cloud-managed Kubernetes)
  • Basic understanding of Kubernetes resource management

Core Optimization Strategies

1. Resource Management

Resource Quotas

Prevent resource hogging and overprovisioning:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: production
spec:
  hard:
    # Compute Resources
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    
    # Storage
    requests.storage: 500Gi
    persistentvolumeclaims: "20"
    
    # Object Counts
    pods: "100"
    services: "50"
    configmaps: "50"
    secrets: "50"

Container Limits

Set default resource constraints:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:
      memory: "256Mi"
      cpu: "200m"
    max:
      memory: "2Gi"
      cpu: "2"
    min:
      memory: "128Mi"
      cpu: "100m"
    type: Container

2. Dynamic Scaling

Horizontal Pod Autoscaling

Implement intelligent scaling based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60

Vertical Pod Autoscaling

Optimize resource allocation automatically:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-service
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 1
        memory: 2Gi
      controlledResources: ["cpu", "memory"]

3. Infrastructure Optimization

Node Pool Strategies

Implement cost-effective instance management:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production-cluster
  region: us-west-2
nodeGroups:
  - name: mixed-instances-1
    instanceTypes:
      - t3.large
      - t3a.large
      - m5.large
      - m5a.large
    desiredCapacity: 3
    minSize: 1
    maxSize: 10
    spotInstances: true
    spotAllocationStrategy: capacity-optimized

Workload Placement

Optimize pod scheduling for cost efficiency:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cost-optimized-app
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: instance-type
                operator: In
                values:
                - spot
      tolerations:
      - key: spot
        operator: Equal
        value: "true"
        effect: NoSchedule

4. Storage Optimization

StorageClass Configuration

Implement cost-effective storage tiers:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-retain
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iopsPerGB: "3000"
  throughput: "125"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Cost Monitoring and Analysis

Implementing a robust cost monitoring solution is crucial for maintaining visibility into your Kubernetes spending and identifying optimization opportunities.

Kubecost Implementation

Prerequisites
  • Kubernetes cluster (v1.16+)
  • Helm 3
  • Metrics Server installed
  • At least 2 CPU cores and 4GB RAM available
  • Storage class for persistent volumes
Installation Methods
Quick Installation

For a quick setup with default configuration:

# Create namespace
kubectl create namespace kubecost

# Add Helm repository
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm repo update

# Quick install with basic configuration
helm install kubecost kubecost/cost-analyzer \
    --namespace kubecost \
    --set kubecostToken="" \
    --set prometheus.nodeExporter.enabled=true \
    --set prometheus.serviceMonitor.enabled=true \
    --set networkCosts.enabled=true \
    --set grafana.enabled=true
Advanced Installation

For production environments or custom configurations:

  1. Create Configuration File
# kubecost-values.yaml
global:
  prometheus:
    enabled: true
    fqdn: http://kubecost-prometheus-server.kubecost.svc
    
kubecostProductConfigs:
  clusters: ["cluster-one"]

networkCosts:
  enabled: true
  config:
    services: true
    pods: true
    nodes: true

prometheus:
  nodeExporter:
    enabled: true
  serviceMonitor:
    enabled: true
  server:
    retention: 15d
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 1000m
        memory: 4Gi
  
grafana:
  enabled: true
  sidecar:
    dashboards:
      enabled: true
  resources:
    requests:
      cpu: 100m
      memory: 512Mi
    limits:
      cpu: 200m
      memory: 1Gi

serviceMonitor:
  enabled: true

persistentVolume:
  size: "0.2Gi"
  dbSize: "32.0Gi"

notifications:
  slack:
    enabled: true
    webhook: "https://hooks.slack.com/services/your/webhook/url"
  
thanos:
  enabled: false  # Enable for long-term storage

etl:
  enabled: true
  resources:
    requests:
      cpu: 200m
      memory: 1Gi
    limits:
      cpu: 400m
      memory: 2Gi
  1. Install with Custom Configuration
helm install kubecost kubecost/cost-analyzer \
    --namespace kubecost \
    -f kubecost-values.yaml

OpenCost Implementation

Prerequisites
  • Kubernetes cluster (v1.16+)
  • Helm 3
  • Prometheus installed
  • Metrics Server
Installation Methods
Quick Installation

For a quick setup with basic configuration:

# Create namespace
kubectl create namespace opencost

# Add Helm repository
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

# Create a values file for annotations
cat <<EOF > opencost-values.yaml
opencost:
  prometheus:
    external:
      url: http://prometheus-server.monitoring.svc
  metrics:
    window: "1d"  # Default window for queries
    resolution: "1h"  # Default resolution for queries
  ui:
    enabled: true
    port: 9003
  exporter:
    enable: true
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 256Mi

service:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9003"
  type: ClusterIP
EOF

# Install OpenCost using values file
helm install opencost opencost/opencost \
    --namespace opencost \
    -f opencost-values.yaml
Advanced Installation

For production environments or custom configurations:

  1. Create Configuration File
# opencost-values.yaml
opencost:
  prometheus:
    external:
      url: http://prometheus-server.monitoring.svc
  
  exporter:
    enable: true
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 500m
        memory: 256Mi

  metrics:
    serviceMonitor:
      enabled: true
      interval: 30s
    window: 1h
    resolution: 1m

  cloudProvider:
    enabled: true
    aws:
      enabled: true
      secretName: aws-secret
      secretKey: credentials

  customPricing:
    enabled: true
    configPath: /models/pricing/configs/custom.json
    data:
      CPU: 0.031611
      RAM: 0.004237
      storage: 0.00005
      GPU: 0.95

service:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9003"

ui:
  enabled: true
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: nginx
    hosts:
      - host: opencost.example.com
        paths:
          - path: /
            pathType: Prefix

persistentVolume:
  enabled: true
  size: 10Gi
  1. Install with Custom Configuration
helm install opencost opencost/opencost \
    --namespace opencost \
    -f opencost-values.yaml
Using OpenCost
  1. Access Dashboard
# Port forward the OpenCost service
kubectl port-forward -n opencost svc/opencost 9003:9003
  1. Access the UI
# The UI is available at
open http://localhost:9003
  1. API Endpoints
# Get cost allocation for the last 24 hours
curl "http://localhost:9003/allocation/compute?window=24h"

# Get cost allocation for a specific time window
curl "http://localhost:9003/allocation/compute?window=168h"  # Last 7 days

# Get asset information with window
curl "http://localhost:9003/assets?window=24h"

# Get efficiency metrics
curl "http://localhost:9003/efficiency?window=24h"

# Get all available metrics
curl "http://localhost:9003/metrics"

# Get cost allocation by namespace
curl "http://localhost:9003/allocation/compute?window=24h&aggregate=namespace"

# Get cost allocation by label
curl "http://localhost:9003/allocation/compute?window=24h&aggregate=label:app"

Common window parameters:

  • 1h: Last hour
  • 24h: Last 24 hours
  • 7d: Last 7 days
  • 30d: Last 30 days
  • Custom range: window=2023-01-01T00:00:00Z,2023-01-31T23:59:59Z
  1. Verify Installation
# Check if pods are running
kubectl get pods -n opencost

# Check service
kubectl get svc -n opencost

# View logs
kubectl logs -n opencost deployment/opencost

# Check if metrics are being collected
curl "http://localhost:9003/metrics" | grep "opencost_"

Key Differences:

  1. Port Numbers:

    • Kubecost uses port 9090 by default
    • OpenCost uses port 9003 by default
  2. UI Access:

    • Kubecost provides a full-featured web UI at the root path
    • OpenCost’s UI is available at the /ui path
  3. API Access:

    • Kubecost API is primarily accessed through the UI
    • OpenCost provides direct REST API endpoints for programmatic access
  4. Dashboard Features:

    • Kubecost offers more built-in visualizations and reports
    • OpenCost focuses on API-first approach with basic UI visualizations

Real-World Results

Cost Reduction Case Studies

  1. E-Commerce Platform

    • Initial Monthly Cost: $25,000
    • Optimized Cost: $15,000
    • Key Improvements:
      • Resource right-sizing: -25%
      • Spot instance usage: -15%
      • Storage optimization: -10%
      • Improved autoscaling: -10%
  2. SaaS Application

    • Initial Monthly Cost: $40,000
    • Optimized Cost: $28,000
    • Improvements:
      • Cluster consolidation: -20%
      • Network optimization: -15%
      • Resource quotas: -10%

Implementation Success Metrics

  1. Resource Efficiency

    • CPU utilization improved from 35% to 70%
    • Memory utilization improved from 45% to 75%
    • Storage utilization improved from 50% to 85%
  2. Cost Efficiency

    • Cost per transaction reduced by 40%
    • Cost per user reduced by 35%
    • Infrastructure cost per revenue dollar reduced by 25%

Maintenance Guidelines

Daily Operations

  1. Resource Monitoring
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: daily-resource-monitor
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: resource-metrics
  endpoints:
  - port: metrics
    interval: 5m
    path: /metrics
  1. Cost Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: daily-cost-alerts
spec:
  groups:
  - name: daily.costs
    rules:
    - alert: DailyCostSpike
      expr: sum(rate(container_cpu_usage_seconds_total[1h])) by (namespace) > day_over_day_threshold
      for: 1h
      labels:
        severity: warning
      annotations:
        description: Daily cost spike detected in namespace

Weekly Tasks

  1. Resource Optimization
apiVersion: batch/v1
kind: CronJob
metadata:
  name: weekly-optimization
spec:
  schedule: "0 0 * * 0"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: optimizer
            image: resource-optimizer:v1
            args:
            - --analyze-usage
            - --suggest-optimizations
            - --generate-report
  1. Compliance Checks
apiVersion: batch/v1
kind: CronJob
metadata:
  name: compliance-check
spec:
  schedule: "0 0 * * 1"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: compliance-checker
            image: compliance-check:v1
            env:
            - name: CHECK_RESOURCE_QUOTAS
              value: "true"
            - name: CHECK_COST_LABELS
              value: "true"

Monthly Reviews

  1. Cost Analysis

    • Review monthly trends
    • Compare against budgets
    • Identify optimization opportunities
    • Update cost allocation models
  2. Performance Impact

    • Review service SLAs
    • Analyze resource efficiency
    • Update scaling policies
    • Optimize resource requests

Quarterly Planning

  1. Strategy Review

    • Evaluate cost optimization goals
    • Update resource allocation strategies
    • Review cloud provider pricing
    • Plan major optimizations
  2. Capacity Planning

    • Forecast resource needs
    • Plan cluster scaling
    • Review storage requirements
    • Update budget allocations

Automation Tools

  1. Resource Cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
  name: resource-cleanup
spec:
  schedule: "0 0 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cleanup
            image: resource-janitor:v1
            env:
            - name: CLEANUP_UNUSED_PVS
              value: "true"
            - name: CLEANUP_UNBOUND_PVS
              value: "true"
            - name: MAX_PV_AGE_DAYS
              value: "7"
  1. Cost Forecasting
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cost-forecasting
spec:
  schedule: "0 0 1 * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: forecaster
            image: cost-forecaster:v1
            env:
            - name: FORECAST_MONTHS
              value: "3"
            - name: INCLUDE_GROWTH_PATTERNS
              value: "true"
            - name: ALERT_ON_THRESHOLD
              value: "true"

Best Practices for Cost Optimization

1. Resource Right-Sizing

Memory and CPU Optimization

apiVersion: apps/v1
kind: Deployment
metadata:
  name: optimized-app
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: "200m"    # Based on actual usage patterns
            memory: "256Mi"
          limits:
            cpu: "500m"    # Prevent resource hogging
            memory: "512Mi"
        # Enable resource metrics collection
        env:
        - name: JAVA_OPTS
          value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"

Best Practice Tips:

  • Start with metrics-based resource requests
  • Set limits 2-3x higher than requests
  • Use container-aware JVM settings
  • Monitor and adjust based on actual usage

2. Cost-Effective Storage

Storage Class Configuration

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: optimized-storage
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iopsPerGB: "3000"
  encrypted: "true"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Storage Optimization Tips:

  • Use appropriate storage tiers
  • Enable volume expansion
  • Implement automatic backup cleanup
  • Configure volume snapshots

3. Network Cost Management

Network Policy Implementation

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: optimize-egress
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          purpose: production
    ports:
    - port: 443
      protocol: TCP

Network Optimization Tips:

  • Use regional clusters when possible
  • Implement cross-zone traffic policies
  • Enable VPC endpoints for cloud services
  • Monitor and optimize egress traffic

4. Workload Scheduling

Node Affinity and Anti-Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cost-optimized-deployment
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - spot
                - preemptible
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  app: high-availability

Scheduling Best Practices:

  • Use spot/preemptible instances for suitable workloads
  • Implement proper pod disruption budgets
  • Balance between cost and availability
  • Consider time-based scaling

Implementation Roadmap

Phase 1: Assessment and Planning (Week 1-2)

  1. Baseline Measurement
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cost-baseline
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: metrics-collector
            image: metrics-collector:v1
            env:
            - name: METRICS_ENDPOINT
              value: "http://prometheus:9090"
            - name: EXPORT_BUCKET
              value: "s3://cost-metrics/baseline"
  1. Resource Audit
#!/bin/bash
# audit-resources.sh
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, capacity: .status.capacity, allocatable: .status.allocatable}'
kubectl get pods --all-namespaces -o json | jq '.items[] | {name: .metadata.name, namespace: .metadata.namespace, requests: .spec.containers[].resources.requests, limits: .spec.containers[].resources.limits}'

Phase 2: Initial Optimization (Week 3-4)

  1. Resource Quotas Implementation
apiVersion: v1
kind: ResourceQuota
metadata:
  name: phase-1-quota
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
  1. Monitoring Setup
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cost-monitor
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: cost-metrics
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - monitoring

Phase 3: Advanced Optimization (Week 5-8)

  1. Automated Scaling Implementation
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: phase-3-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: target-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
  1. Cost Allocation Implementation
apiVersion: v1
kind: Namespace
metadata:
  name: team-a
  labels:
    cost-center: "1001"
    department: "engineering"
    environment: "production"
annotations:
    billing.kubecost.com/alert: "true"
    billing.kubecost.com/alert-threshold: "1000"

Phase 4: Continuous Optimization (Ongoing)

  1. Automated Cost Reports
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cost-report
spec:
  schedule: "0 0 * * 1"  # Weekly
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: reporter
            image: cost-reporter:v1
            env:
            - name: REPORT_RECIPIENTS
              value: "finance@company.com,engineering@company.com"
            - name: COST_THRESHOLD
              value: "10000"
            - name: ALERT_ON_INCREASE
              value: "20"  # Alert on 20% increase
  1. Continuous Monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-alerts
spec:
  groups:
  - name: costs
    rules:
    - alert: CostSpike
      expr: rate(container_cpu_usage_seconds_total[6h]) > 2 * avg_over_time(container_cpu_usage_seconds_total[7d])
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: Cost spike detected
        description: Resource usage significantly higher than weekly average

Implementation Checklist

  1. Pre-Implementation

    • Baseline cost metrics collected
    • Resource utilization patterns analyzed
    • Team responsibilities assigned
    • Success metrics defined
  2. Phase 1 Checklist

    • Resource quotas implemented
    • Monitoring tools installed
    • Initial cost baseline established
    • Team training completed
  3. Phase 2 Checklist

    • Autoscaling configured
    • Storage optimizations implemented
    • Network policies defined
    • Initial cost reductions verified
  4. Phase 3 Checklist

    • Advanced monitoring in place
    • Cost allocation implemented
    • Automated reporting configured
    • Team dashboards created
  5. Continuous Optimization

    • Weekly cost reviews scheduled
    • Monthly optimization targets set
    • Quarterly strategy reviews planned
    • Annual cost analysis framework established

Monitoring and Control

Cost Monitoring

Prometheus Integration

Track resource utilization and costs:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: cost-metrics
spec:
  selector:
    matchLabels:
      app: cost-exporter
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
  namespaceSelector:
    matchNames:
    - monitoring

Alert Configuration

Implement proactive cost controls:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cost-alerts
spec:
  groups:
  - name: cost.rules
    rules:
    - alert: HighCostSpike
      expr: |
        sum(
          rate(container_cpu_usage_seconds_total{container!=""}[1h])
        ) by (namespace) * on() group_left() cluster_hourly_rate > 100
      for: 1h
      labels:
        severity: warning
      annotations:
        summary: High cost detected in namespace {{ $labels.namespace }}
        description: Hourly cost has exceeded $100 threshold

Real-World Results

Cost Reduction Case Study

Before Optimization:
- Monthly Cost: $3,900
- Resource Utilization: 35%
- Idle Resources: 40%

After Optimization:
- Monthly Cost: $2,100 (46% reduction)
- Resource Utilization: 75%
- Idle Resources: 10%

Key Improvements:
1. Resource right-sizing: -25% cost
2. Spot instance usage: -15% cost
3. Autoscaling implementation: -20% cost
4. Storage optimization: -10% cost

Maintenance Guidelines

Daily Operations

  1. Resource Monitoring
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: daily-resource-monitor
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: resource-metrics
  endpoints:
  - port: metrics
    interval: 5m
    path: /metrics
  1. Cost Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: daily-cost-alerts
spec:
  groups:
  - name: daily.costs
    rules:
    - alert: DailyCostSpike
      expr: sum(rate(container_cpu_usage_seconds_total[1h])) by (namespace) > day_over_day_threshold
      for: 1h
      labels:
        severity: warning
      annotations:
        description: Daily cost spike detected in namespace

Weekly Tasks

  1. Resource Optimization
apiVersion: batch/v1
kind: CronJob
metadata:
  name: weekly-optimization
spec:
  schedule: "0 0 * * 0"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: optimizer
            image: resource-optimizer:v1
            args:
            - --analyze-usage
            - --suggest-optimizations
            - --generate-report
  1. Compliance Checks
apiVersion: batch/v1
kind: CronJob
metadata:
  name: compliance-check
spec:
  schedule: "0 0 * * 1"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: compliance-checker
            image: compliance-check:v1
            env:
            - name: CHECK_RESOURCE_QUOTAS
              value: "true"
            - name: CHECK_COST_LABELS
              value: "true"

Monthly Reviews

  1. Cost Analysis

    • Review monthly trends
    • Compare against budgets
    • Identify optimization opportunities
    • Update cost allocation models
  2. Performance Impact

    • Review service SLAs
    • Analyze resource efficiency
    • Update scaling policies
    • Optimize resource requests

Quarterly Planning

  1. Strategy Review

    • Evaluate cost optimization goals
    • Update resource allocation strategies
    • Review cloud provider pricing
    • Plan major optimizations
  2. Capacity Planning

    • Forecast resource needs
    • Plan cluster scaling
    • Review storage requirements
    • Update budget allocations

Automation Tools

  1. Resource Cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
  name: resource-cleanup
spec:
  schedule: "0 0 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cleanup
            image: resource-janitor:v1
            env:
            - name: CLEANUP_UNUSED_PVS
              value: "true"
            - name: CLEANUP_UNBOUND_PVS
              value: "true"
            - name: MAX_PV_AGE_DAYS
              value: "7"
  1. Cost Forecasting
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cost-forecasting
spec:
  schedule: "0 0 1 * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: forecaster
            image: cost-forecaster:v1
            env:
            - name: FORECAST_MONTHS
              value: "3"
            - name: INCLUDE_GROWTH_PATTERNS
              value: "true"
            - name: ALERT_ON_THRESHOLD
              value: "true"

Resources

Documentation

Tools