ArgoCD Monitoring and Alerting

Comprehensive guide for monitoring ArgoCD and setting up alerts

ArgoCD Monitoring and Alerting

This guide covers comprehensive monitoring and alerting setup for ArgoCD, including Prometheus, Grafana, and notification systems.

Video Tutorial

Learn more about ArgoCD monitoring and alerting in this comprehensive video tutorial:

View Source Code

Prometheus Integration

1. ServiceMonitor Configuration

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: argocd
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-metrics
  endpoints:
  - port: metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - argocd

2. Application Metrics

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-server-metrics
  namespace: argocd
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: argocd-server
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s

Grafana Dashboards

1. ArgoCD Overview Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "true"
data:
  argocd-overview.json: |
    {
      "title": "ArgoCD Overview",
      "panels": [
        {
          "title": "Sync Status",
          "type": "gauge",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "sum(argocd_app_sync_status{status=\"Synced\"})"
            }
          ]
        },
        {
          "title": "Health Status",
          "type": "gauge",
          "targets": [
            {
              "expr": "sum(argocd_app_health_status{status=\"Healthy\"})"
            }
          ]
        }
      ]
    }

2. Application Performance Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-performance-dashboard
  namespace: monitoring
data:
  app-performance.json: |
    {
      "title": "Application Performance",
      "panels": [
        {
          "title": "Sync Duration",
          "type": "graph",
          "targets": [
            {
              "expr": "rate(argocd_app_sync_duration_seconds_sum[5m])"
            }
          ]
        },
        {
          "title": "Resource Operations",
          "type": "graph",
          "targets": [
            {
              "expr": "sum(rate(argocd_app_k8s_request_total[5m])) by (verb)"
            }
          ]
        }
      ]
    }

Alert Rules

1. Sync Status Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
  namespace: monitoring
spec:
  groups:
  - name: argocd
    rules:
    - alert: ApplicationOutOfSync
      expr: |
        sum(argocd_app_sync_status{status!="Synced"}) > 0
      for: 15m
      labels:
        severity: warning
      annotations:
        summary: Application out of sync for more than 15 minutes
        description: "{{ $value }} applications are out of sync"

    - alert: ApplicationUnhealthy
      expr: |
        sum(argocd_app_health_status{status!="Healthy"}) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Application health check failed
        description: "{{ $value }} applications are unhealthy"

2. Performance Alerts

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-performance-alerts
spec:
  groups:
  - name: argocd-performance
    rules:
    - alert: HighSyncFailureRate
      expr: |
        rate(argocd_app_sync_total{status="Failed"}[5m]) > 0.1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: High sync failure rate detected
        
    - alert: SlowSync
      expr: |
        argocd_app_sync_duration_seconds > 300
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Sync operation taking too long

Notification Templates

1. Slack Notifications

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
data:
  service.slack: |
    token: $slack-token
  template.app-sync-status: |
    message: |
      Application {{.app.metadata.name}} sync status is {{.app.status.sync.status}}
      Application details: {{.context.argocdUrl}}/applications/{{.app.metadata.name}}
  template.app-health-status: |
    message: |
      Application {{.app.metadata.name}} health status is {{.app.status.health.status}}
      Application details: {{.context.argocdUrl}}/applications/{{.app.metadata.name}}
  trigger.on-sync-status-changed: |
    - when: app.status.sync.status == 'OutOfSync'
      send: [app-sync-status]
  trigger.on-health-status-changed: |
    - when: app.status.health.status == 'Degraded'
      send: [app-health-status]

2. Email Notifications

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
data:
  service.email: |
    host: smtp.gmail.com
    port: 587
    from: argocd@yourcompany.com
  template.app-sync-failed: |
    email:
      subject: Application {{.app.metadata.name}} sync failed
      body: |
        Application {{.app.metadata.name}} sync operation failed.
        Time: {{.app.status.operationState.finishedAt}}
        Error: {{.app.status.operationState.message}}
  trigger.on-sync-failed: |
    - when: app.status.operationState.phase == 'Error'
      send: [app-sync-failed]

Custom Metrics

1. Application-specific Metrics

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: myapp
  endpoints:
  - port: metrics
    path: /metrics
    interval: 30s
    metricRelabelings:
    - sourceLabels: [__name__]
      regex: 'app_.*'
      action: keep

2. Resource Usage Metrics

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: resource-metrics
spec:
  groups:
  - name: resources
    rules:
    - record: argocd:app_resource_usage:memory
      expr: |
        sum(
          container_memory_usage_bytes{container!=""}
        ) by (namespace, pod)
    - record: argocd:app_resource_usage:cpu
      expr: |
        sum(
          rate(container_cpu_usage_seconds_total{container!=""}[5m])
        ) by (namespace, pod)

Logging Configuration

1. Logging Setup

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
data:
  logging.level: debug
  logging.format: json
  logging.components.controller: debug
  logging.components.repo-server: debug
  logging.components.server: debug

2. Log Aggregation

apiVersion: logging.banzaicloud.io/v1beta1
kind: Flow
metadata:
  name: argocd-logs
spec:
  filters:
    - tag_normaliser: {}
    - parser:
        remove_key_name_field: true
        reserve_data: true
        parse:
          type: json
  match:
    - select:
        labels:
          app.kubernetes.io/name: argocd-server
  localOutputRefs:
    - elasticsearch

Best Practices Checklist

  1. Set up basic metrics
  2. Configure detailed dashboards
  3. Implement alerting
  4. Enable notifications
  5. Monitor resources
  6. Track performance
  7. Aggregate logs
  8. Custom metrics
  9. Regular review
  10. Documentation

Performance Optimization

1. Resource Limits

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-server
spec:
  template:
    spec:
      containers:
      - name: argocd-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1024Mi

2. Scaling Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname

Conclusion

Proper monitoring and alerting are crucial for maintaining a healthy ArgoCD installation. Regular review and updates of monitoring configurations ensure optimal operation.

Additional Resources