How to Set Up Production EKS Monitoring with Helm, ALB Grouping, and Custom SMTP Alerts

Rate this post

Monitoring a production Kubernetes cluster can quickly become complex. Determining how to manage storage, expose dashboards securely, and ensure you receive clean, actionable alerts requires careful planning.

In this guide, we will walk through how to deploy a production-grade monitoring setup on an Amazon EKS cluster using the standard kube-prometheus-stack Helm chart.

We will also cover how to:

  1. Reduce AWS Costs by routing Grafana, Prometheus, and Alertmanager through a single Application Load Balancer (ALB) using ALB Ingress Grouping.
  2. Configure Secure SMTP Alerts using Gmail SMTP.
  3. Solve Common Silent Pitfalls in Helm array-merging and SMTP handshakes.
  4. Beautify Alert Emails with a clean, custom HTML template.

πŸ—οΈ Architecture Overview

The core of our monitoring setup revolves around the Prometheus Operator. We will deploy:

  • Prometheus Server: To scrape, query, and store metric data.
  • Grafana: To visualize metrics using dashboards.
  • Alertmanager: To handle and route alerts triggered by Prometheus.
  • Node Exporter & Kube State Metrics: To gather infrastructure and cluster-level states.
  • AWS ALB Ingress Grouping: To bundle routing rules so that grafana.yourdomain.com, prometheus.yourdomain.com, and alertmanager.yourdomain.com are hosted behind a single AWS Load Balancer instance.

πŸ› οΈ Step 1: Base Configuration (values.yaml)

We start by setting up values.yaml to define resource limits, GP3 storage provisioning, and the ingress parameters.

Save the following file as values.yaml:

# values.yaml
global:
  rbac:
    create: true

defaultRules:
  create: true

# Prometheus Operator
prometheusOperator:
  enabled: true
  resources:
    limits: { cpu: 200m, memory: 256Mi }
    requests: { cpu: 100m, memory: 128Mi }

kubeStateMetrics:
  enabled: true

nodeExporter:
  enabled: true

# Grafana Configuration
grafana:
  enabled: true
  adminUser: admin
  assertNoLeakedSecrets: false # Required to pass SMTP password directly in secrets
  service:
    type: ClusterIP
    port: 80
  persistence:
    enabled: true
    storageClassName: gp3
    size: 20Gi
    accessModes: [ReadWriteOnce]
  ingress:
    enabled: true
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-uuid>
      alb.ingress.kubernetes.io/group.name: your-alb-ingress-group
      alb.ingress.kubernetes.io/group.order: '-3'
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/ssl-redirect: '443'
      alb.ingress.kubernetes.io/target-type: ip
    hosts: [grafana.yourdomain.com]
    path: /
    pathType: Prefix
  resources:
    requests: { cpu: 100m, memory: 256Mi }
    limits: { cpu: 500m, memory: 512Mi }

# Prometheus Server Configuration
prometheus:
  enabled: true
  service:
    type: ClusterIP
    port: 9090
  ingress:
    enabled: true
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-uuid>
      alb.ingress.kubernetes.io/group.name: your-alb-ingress-group
      alb.ingress.kubernetes.io/group.order: '-3'
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/ssl-redirect: '443'
      alb.ingress.kubernetes.io/target-type: ip
    hosts: [prometheus.yourdomain.com]
    paths: [/]
    pathType: Prefix
  prometheusSpec:
    retention: 30d
    scrapeInterval: 30s
    evaluationInterval: 30s
    resources:
      requests: { cpu: 500m, memory: 2Gi }
      limits: { cpu: "2", memory: 4Gi }
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: [ReadWriteOnce]
          resources:
            requests: { storage: 100Gi }

# Alertmanager Configuration
alertmanager:
  enabled: true
  service:
    type: ClusterIP
    port: 9093
  ingress:
    enabled: true
    ingressClassName: alb
    annotations:
      alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-uuid>
      alb.ingress.kubernetes.io/group.name: your-alb-ingress-group
      alb.ingress.kubernetes.io/group.order: '-3'
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/ssl-redirect: '443'
      alb.ingress.kubernetes.io/target-type: ip
    hosts: [alertmanager.yourdomain.com]
    paths: [/]
    pathType: Prefix
  alertmanagerSpec:
    replicas: 2 # Configured in HA Mode (2 replicas)
    resources:
      requests: { cpu: 100m, memory: 256Mi }
      limits: { cpu: 500m, memory: 512Mi }
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: [ReadWriteOnce]
          resources:
            requests: { storage: 5Gi }

  config:
    global:
      resolve_timeout: 1m
    route:
      group_wait: 10s
      group_interval: 2m
      repeat_interval: 2m
      receiver: 'gmail-notifications'
      routes: [] # CRITICAL: Overrides default routes that point to the undefined 'null' receiver
    receivers:
      - name: 'gmail-notifications'

πŸ”’ Step 2: Configuring Credentials (values-secrets.yaml)

To ensure credentials are not committed to Git, keep all sensitive parameters in a separate file.

Create values-secrets.yaml:

# values-secrets.yaml
grafana:
  adminPassword: "YourStrongGrafanaPassword"
  grafana.ini:
    smtp:
      enabled: true
      host: "smtp.gmail.com:587"
      user: "your-email@gmail.com"
      password: "your-gmail-app-password" # 16-character Google App Password
      from_address: "your-email@gmail.com"
      from_name: "Grafana EKS Alerts"
      skip_verify: false

alertmanager:
  config:
    global:
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'your-email@gmail.com'
      smtp_auth_username: 'your-email@gmail.com'
      smtp_auth_password: 'your-gmail-app-password'
      smtp_require_tls: true

    receivers:
      - name: 'gmail-notifications'
        email_configs:
          - to: 'recipient-team@yourdomain.com'
            send_resolved: true
            auth_username: 'your-email@gmail.com'
            auth_password: 'your-gmail-app-password'
            from: 'your-email@gmail.com'
            smarthost: 'smtp.gmail.com:587'
            headers:
              Subject: 'πŸ‡¨πŸ‡¦ [{{ .Status | toUpper }}] EKS Alert: {{ if .CommonLabels.alertname }}{{ .CommonLabels.alertname }}{{ else }}{{ .Alerts | len }} alerts{{ end }}'
            html: |
              # ... (See Step 5 for the Custom HTML template code)

🚨 Troubleshooting Alertmanager SMTP: The Silent Pitfalls

Setting up SMTP alerts with Alertmanager often leads to debugging cycles. Here are three critical issues we solved:

Pitfall 1: Helm Array Overwrite Bug (undefined receiver "null" used in route)

By default, the kube-prometheus-stack chart defines default sub-routing rules that point to a receiver named null. When you customize your receivers, Helm does not append to the arrayβ€”it completely overwrites it.

If you define a custom receivers array without redefining null, Alertmanager fails to reconcile, showing:
failed to initialize from secret: undefined receiver "null" used in route

  • The Fix: Explicitly declare routes: [] under route: in your values.yaml to clear the default sub-routing rules that refer to the missing null receiver.

Pitfall 2: The auth_identity Handshake Rejection

When using Go-based PlainAuth (used by Alertmanager) with Gmail, specifying auth_identity as your email address will cause the authentication handshake to fail.

  • The Fix: Remove the auth_identity field completely or leave it as an empty string. Go’s PlainAuth automatically assumes the login username when auth_identity is blank.

Pitfall 3: Microsoft 365 Group Senders Block

If you are routing alerts to an Outlook Group email (e.g. alerts@yourdomain.com) and not getting the emails despite successful SMTP logs:

  • The Fix: Open your Microsoft 365 Outlook Group settings, go to Edit Settings, and ensure the checkbox β€œLet people outside the organization email the groups” is checked. Otherwise, Exchange blocks incoming emails from your external Gmail sender.

🎨 Step 3: Designing a Custom HTML Email Template

Rather than receiving unformatted text containing hundreds of internal Prometheus labels, we can configure a responsive HTML format:

Add this block under the html: parameter in values-secrets.yamlβ€˜s email_configs section:

<html>
<head>
  <style>
    body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif; color: #333333; line-height: 1.6; margin: 0; padding: 20px; background-color: #f9f9f9; }
    .container { max-width: 600px; margin: 0 auto; background: #ffffff; padding: 25px; border-radius: 8px; border: 1px solid #e1e4e8; box-shadow: 0 4px 12px rgba(0,0,0,0.05); }
    .header { margin-bottom: 20px; border-bottom: 2px solid #eaecef; padding-bottom: 10px; }
    .header h2 { margin: 0; color: #24292e; font-size: 20px; }
    .alert-item { margin-bottom: 20px; padding: 15px; border-radius: 6px; border-left: 5px solid #d9534f; background-color: #fff9f9; }
    .alert-item.resolved { border-left-color: #28a745; background-color: #f6ffed; }
    .alert-item.warning { border-left-color: #ffc107; background-color: #fffdf5; }
    .alert-title { font-size: 16px; font-weight: bold; margin-bottom: 10px; color: #24292e; }
    .meta-table { width: 100%; border-collapse: collapse; margin-top: 10px; }
    .meta-table td { padding: 4px 0; vertical-align: top; font-size: 13px; }
    .meta-label { width: 100px; font-weight: bold; color: #586069; }
    .meta-val { color: #24292e; }
    .description-box { margin-top: 10px; padding: 10px; background: #f6f8fa; border-radius: 4px; font-size: 13px; border: 1px solid #e1e4e8; }
    .footer { margin-top: 30px; font-size: 12px; color: #6a737d; text-align: center; border-top: 1px solid #eaecef; padding-top: 15px; }
    .btn { display: inline-block; padding: 6px 12px; font-size: 13px; font-weight: bold; color: #ffffff !important; background-color: #0366d6; text-decoration: none; border-radius: 4px; }
    .btn-runbook { background-color: #28a745; margin-left: 10px; }
  </style>
</head>
<body>
  <div class="container">
    <div class="header">
      <h2>πŸ‡¨πŸ‡¦ Canada EKS Production Alerts</h2>
    </div>
    {{ range .Alerts }}
      <div class="alert-item {{ if eq .Status "resolved" }}resolved{{ else }}{{ .Labels.severity }}{{ end }}">
        <div class="title alert-title">
          [{{ .Status | toUpper }}] {{ .Labels.alertname }}
        </div>
        <table class="meta-table">
          <tr>
            <td class="meta-label">Severity:</td>
            <td class="meta-val"><span style="text-transform: capitalize; font-weight: bold; color: {{ if eq .Labels.severity "critical" }}#d9534f{{ else if eq .Labels.severity "warning" }}#ffc107{{ else }}#28a745{{ end }}">{{ .Labels.severity }}</span></td>
          </tr>
          {{ if .Labels.namespace }}
          <tr>
            <td class="meta-label">Namespace:</td>
            <td class="meta-val"><code>{{ .Labels.namespace }}</code></td>
          </tr>
          {{ end }}
          {{ if .Labels.pod }}
          <tr>
            <td class="meta-label">Pod:</td>
            <td class="meta-val"><code>{{ .Labels.pod }}</code></td>
          </tr>
          {{ end }}
          {{ if .Labels.container }}
          <tr>
            <td class="meta-label">Container:</td>
            <td class="meta-val"><code>{{ .Labels.container }}</code></td>
          </tr>
          {{ end }}
          {{ if .Annotations.summary }}
          <tr>
            <td class="meta-label">Summary:</td>
            <td class="meta-val">{{ .Annotations.summary }}</td>
          </tr>
          {{ end }}
        </table>
        {{ if .Annotations.description }}
        <div class="description-box">
          <strong>Description:</strong><br/>
          {{ .Annotations.description }}
        </div>
        {{ end }}
        <div style="margin-top: 15px;">
          <a class="btn" href="{{ .GeneratorURL }}" target="_blank">View in Prometheus</a>
          {{ if .Annotations.runbook_url }}
            <a class="btn btn-runbook" href="{{ .Annotations.runbook_url }}" target="_blank">Runbook</a>
          {{ end }}
        </div>
      </div>
    {{ end }}
    <div class="footer">
      <p>Alertmanager: <a href="{{ .ExternalURL }}">View Active Alerts Console</a></p>
      <p style="font-size: 10px;">Sent automatically by Canada EKS Monitoring System.</p>
    </div>
  </div>
</body>
</html>

πŸš€ Step 4: Automated Deployment Script (deploy.ps1)

Automate the process of adding the charts repository, updating dependencies, and running the deployment using this PowerShell script:

# deploy.ps1
$namespace = "monitoring"
$releaseName = "kube-prometheus-stack"
$helmRepoName = "prometheus-community"
$helmRepoUrl = "https://prometheus-community.github.io/helm-charts"

Write-Host "=== Starting EKS Monitoring Deployment ===" -ForegroundColor Green

# 1. Verify kubectl / helm
if (!(Get-Command helm -ErrorAction SilentlyContinue)) { Write-Error "Helm is missing."; exit 1 }
if (!(Get-Command kubectl -ErrorAction SilentlyContinue)) { Write-Error "kubectl is missing."; exit 1 }

# 2. Namespace
kubectl create namespace $namespace --dry-run=client -o yaml | kubectl apply -f -

# 3. Helm Repository
helm repo add $helmRepoName $helmRepoUrl
helm repo update $helmRepoName # Updates ONLY the targeted repository to prevent legacy issues

# 4. Deploy
helm upgrade --install $releaseName "$helmRepoName/kube-prometheus-stack" `
    --namespace $namespace `
    --values values.yaml `
    --values values-secrets.yaml

if ($LASTEXITCODE -eq 0) {
    Write-Host "=== Deployment completed successfully! ===" -ForegroundColor Green
} else {
    Write-Error "=== Deployment failed! ==="
    exit 1
}

To run:

.\deploy.ps1

πŸ“ˆ Conclusion

By grouping our Ingress routes behind a single AWS Application Load Balancer, we drastically cut down AWS Load Balancer costs. Setting up structured secrets keeps production parameters safe, and configuring a beautiful HTML template ensures that your on-call engineers get alerts that are readable, color-coded, and highly actionable. Happy helming!

Share On:

Leave a Comment