Kubernetes Monitoring Best Practices
Implement effective monitoring and observability in your Kubernetes clusters
Kubernetes Monitoring Best Practices
Effective monitoring is crucial for maintaining healthy Kubernetes clusters. This guide covers essential monitoring and observability practices.
Prerequisites
- Basic understanding of Kubernetes
- Access to a Kubernetes cluster
- kubectl CLI tool installed
- Familiarity with monitoring concepts
Project Structure
.
├── monitoring/
│ ├── prometheus/ # Prometheus configurations
│ ├── grafana/ # Grafana dashboards
│ ├── alerts/ # Alert configurations
│ └── logging/ # Logging configurations
└── metrics/
├── custom-metrics/ # Custom metric definitions
└── dashboards/ # Dashboard templates
Prometheus Setup
1. Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
team: frontend
resources:
requests:
memory: 400Mi
enableAdminAPI: false
2. Service Monitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-monitor
spec:
selector:
matchLabels:
app: web-app
endpoints:
- port: metrics
interval: 15s
Alert Configuration
1. Alert Manager
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: main
spec:
replicas: 3
configSecret: alertmanager-config
2. Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
spec:
groups:
- name: kubernetes
rules:
- alert: HighCPUUsage
expr: container_cpu_usage_seconds_total > 90
for: 5m
labels:
severity: warning
annotations:
description: Container {{ $labels.container }} has high CPU usage
- alert: PodCrashLooping
expr: kube_pod_container_status_restarts_total > 5
for: 15m
labels:
severity: critical
Logging Setup
1. Fluentd Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
</parse>
</source>
2. Elasticsearch Configuration
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: logging
spec:
version: 7.15.0
nodeSets:
- name: default
count: 3
config:
node.store.allow_mmap: false
Metrics Collection
1. Custom Metrics
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-metrics
data:
metrics.yaml: |
metrics:
- name: http_requests_total
type: counter
help: Total number of HTTP requests
- name: http_request_duration_seconds
type: histogram
help: HTTP request duration in seconds
2. Node Exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
containers:
- name: node-exporter
image: prom/node-exporter
ports:
- containerPort: 9100
name: metrics
Dashboard Configuration
1. Grafana Dashboard
apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
name: kubernetes-monitoring
spec:
json: |
{
"title": "Kubernetes Monitoring",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"datasource": "Prometheus"
},
{
"title": "Memory Usage",
"type": "graph",
"datasource": "Prometheus"
}
]
}
Best Practices Checklist
- ✅ Set up Prometheus monitoring
- ✅ Configure AlertManager
- ✅ Implement logging solution
- ✅ Create custom metrics
- ✅ Set up dashboards
- ✅ Monitor cluster health
- ✅ Configure alerts
- ✅ Implement log aggregation
- ✅ Set up visualization
- ✅ Regular monitoring review
Monitoring Components
Core Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network traffic
- Pod status
Application Metrics
- Request latency
- Error rates
- Throughput
- Custom business metrics
- Dependencies health
Infrastructure Metrics
- Node health
- Cluster capacity
- Resource utilization
- Network performance
- Storage metrics
Alert Priorities
Critical Alerts
- Node failures
- Pod crashes
- Service outages
- Resource exhaustion
- Security incidents
Warning Alerts
- High resource usage
- Slow response times
- Backup failures
- Certificate expiration
- Storage warnings
Best Practices by Component
Prometheus
- Use service discovery
- Configure retention
- Set up federation
- Implement recording rules
- Regular backups
Grafana
- Use templating
- Organize dashboards
- Set up authentication
- Configure alerting
- Regular updates
Logging
- Structured logging
- Log rotation
- Index management
- Search optimization
- Retention policies
Conclusion
Implementing these monitoring best practices ensures visibility and quick problem resolution in your Kubernetes clusters. Regular review and updates of monitoring configurations are essential for maintaining cluster health.