Monitoring a production Kubernetes cluster can quickly become complex. Determining how to manage storage, expose dashboards securely, and ensure you receive clean, actionable alerts requires careful planning.
In this guide, we will walk through how to deploy a production-grade monitoring setup on an Amazon EKS cluster using the standard kube-prometheus-stack Helm chart.
We will also cover how to:
- Reduce AWS Costs by routing Grafana, Prometheus, and Alertmanager through a single Application Load Balancer (ALB) using ALB Ingress Grouping.
- Configure Secure SMTP Alerts using Gmail SMTP.
- Solve Common Silent Pitfalls in Helm array-merging and SMTP handshakes.
- Beautify Alert Emails with a clean, custom HTML template.
ποΈ Architecture Overview
The core of our monitoring setup revolves around the Prometheus Operator. We will deploy:
- Prometheus Server: To scrape, query, and store metric data.
- Grafana: To visualize metrics using dashboards.
- Alertmanager: To handle and route alerts triggered by Prometheus.
- Node Exporter & Kube State Metrics: To gather infrastructure and cluster-level states.
- AWS ALB Ingress Grouping: To bundle routing rules so that
grafana.yourdomain.com,prometheus.yourdomain.com, andalertmanager.yourdomain.comare hosted behind a single AWS Load Balancer instance.
π οΈ Step 1: Base Configuration (values.yaml)
We start by setting up values.yaml to define resource limits, GP3 storage provisioning, and the ingress parameters.
Save the following file as values.yaml:
# values.yaml
global:
rbac:
create: true
defaultRules:
create: true
# Prometheus Operator
prometheusOperator:
enabled: true
resources:
limits: { cpu: 200m, memory: 256Mi }
requests: { cpu: 100m, memory: 128Mi }
kubeStateMetrics:
enabled: true
nodeExporter:
enabled: true
# Grafana Configuration
grafana:
enabled: true
adminUser: admin
assertNoLeakedSecrets: false # Required to pass SMTP password directly in secrets
service:
type: ClusterIP
port: 80
persistence:
enabled: true
storageClassName: gp3
size: 20Gi
accessModes: [ReadWriteOnce]
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-uuid>
alb.ingress.kubernetes.io/group.name: your-alb-ingress-group
alb.ingress.kubernetes.io/group.order: '-3'
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: '443'
alb.ingress.kubernetes.io/target-type: ip
hosts: [grafana.yourdomain.com]
path: /
pathType: Prefix
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
# Prometheus Server Configuration
prometheus:
enabled: true
service:
type: ClusterIP
port: 9090
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-uuid>
alb.ingress.kubernetes.io/group.name: your-alb-ingress-group
alb.ingress.kubernetes.io/group.order: '-3'
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: '443'
alb.ingress.kubernetes.io/target-type: ip
hosts: [prometheus.yourdomain.com]
paths: [/]
pathType: Prefix
prometheusSpec:
retention: 30d
scrapeInterval: 30s
evaluationInterval: 30s
resources:
requests: { cpu: 500m, memory: 2Gi }
limits: { cpu: "2", memory: 4Gi }
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: [ReadWriteOnce]
resources:
requests: { storage: 100Gi }
# Alertmanager Configuration
alertmanager:
enabled: true
service:
type: ClusterIP
port: 9093
ingress:
enabled: true
ingressClassName: alb
annotations:
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:<your-region>:<your-account-id>:certificate/<your-certificate-uuid>
alb.ingress.kubernetes.io/group.name: your-alb-ingress-group
alb.ingress.kubernetes.io/group.order: '-3'
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80},{"HTTPS":443}]'
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-redirect: '443'
alb.ingress.kubernetes.io/target-type: ip
hosts: [alertmanager.yourdomain.com]
paths: [/]
pathType: Prefix
alertmanagerSpec:
replicas: 2 # Configured in HA Mode (2 replicas)
resources:
requests: { cpu: 100m, memory: 256Mi }
limits: { cpu: 500m, memory: 512Mi }
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: [ReadWriteOnce]
resources:
requests: { storage: 5Gi }
config:
global:
resolve_timeout: 1m
route:
group_wait: 10s
group_interval: 2m
repeat_interval: 2m
receiver: 'gmail-notifications'
routes: [] # CRITICAL: Overrides default routes that point to the undefined 'null' receiver
receivers:
- name: 'gmail-notifications'
π Step 2: Configuring Credentials (values-secrets.yaml)
To ensure credentials are not committed to Git, keep all sensitive parameters in a separate file.
Create values-secrets.yaml:
# values-secrets.yaml
grafana:
adminPassword: "YourStrongGrafanaPassword"
grafana.ini:
smtp:
enabled: true
host: "smtp.gmail.com:587"
user: "your-email@gmail.com"
password: "your-gmail-app-password" # 16-character Google App Password
from_address: "your-email@gmail.com"
from_name: "Grafana EKS Alerts"
skip_verify: false
alertmanager:
config:
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'your-email@gmail.com'
smtp_auth_username: 'your-email@gmail.com'
smtp_auth_password: 'your-gmail-app-password'
smtp_require_tls: true
receivers:
- name: 'gmail-notifications'
email_configs:
- to: 'recipient-team@yourdomain.com'
send_resolved: true
auth_username: 'your-email@gmail.com'
auth_password: 'your-gmail-app-password'
from: 'your-email@gmail.com'
smarthost: 'smtp.gmail.com:587'
headers:
Subject: 'π¨π¦ [{{ .Status | toUpper }}] EKS Alert: {{ if .CommonLabels.alertname }}{{ .CommonLabels.alertname }}{{ else }}{{ .Alerts | len }} alerts{{ end }}'
html: |
# ... (See Step 5 for the Custom HTML template code)
π¨ Troubleshooting Alertmanager SMTP: The Silent Pitfalls
Setting up SMTP alerts with Alertmanager often leads to debugging cycles. Here are three critical issues we solved:
Pitfall 1: Helm Array Overwrite Bug (undefined receiver "null" used in route)
By default, the kube-prometheus-stack chart defines default sub-routing rules that point to a receiver named null. When you customize your receivers, Helm does not append to the arrayβit completely overwrites it.
If you define a custom receivers array without redefining null, Alertmanager fails to reconcile, showing:failed to initialize from secret: undefined receiver "null" used in route
- The Fix: Explicitly declare
routes: []underroute:in yourvalues.yamlto clear the default sub-routing rules that refer to the missingnullreceiver.
Pitfall 2: The auth_identity Handshake Rejection
When using Go-based PlainAuth (used by Alertmanager) with Gmail, specifying auth_identity as your email address will cause the authentication handshake to fail.
- The Fix: Remove the
auth_identityfield completely or leave it as an empty string. Goβs PlainAuth automatically assumes the login username whenauth_identityis blank.
Pitfall 3: Microsoft 365 Group Senders Block
If you are routing alerts to an Outlook Group email (e.g. alerts@yourdomain.com) and not getting the emails despite successful SMTP logs:
- The Fix: Open your Microsoft 365 Outlook Group settings, go to Edit Settings, and ensure the checkbox βLet people outside the organization email the groupsβ is checked. Otherwise, Exchange blocks incoming emails from your external Gmail sender.
π¨ Step 3: Designing a Custom HTML Email Template
Rather than receiving unformatted text containing hundreds of internal Prometheus labels, we can configure a responsive HTML format:
Add this block under the html: parameter in values-secrets.yamlβs email_configs section:
<html>
<head>
<style>
body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif; color: #333333; line-height: 1.6; margin: 0; padding: 20px; background-color: #f9f9f9; }
.container { max-width: 600px; margin: 0 auto; background: #ffffff; padding: 25px; border-radius: 8px; border: 1px solid #e1e4e8; box-shadow: 0 4px 12px rgba(0,0,0,0.05); }
.header { margin-bottom: 20px; border-bottom: 2px solid #eaecef; padding-bottom: 10px; }
.header h2 { margin: 0; color: #24292e; font-size: 20px; }
.alert-item { margin-bottom: 20px; padding: 15px; border-radius: 6px; border-left: 5px solid #d9534f; background-color: #fff9f9; }
.alert-item.resolved { border-left-color: #28a745; background-color: #f6ffed; }
.alert-item.warning { border-left-color: #ffc107; background-color: #fffdf5; }
.alert-title { font-size: 16px; font-weight: bold; margin-bottom: 10px; color: #24292e; }
.meta-table { width: 100%; border-collapse: collapse; margin-top: 10px; }
.meta-table td { padding: 4px 0; vertical-align: top; font-size: 13px; }
.meta-label { width: 100px; font-weight: bold; color: #586069; }
.meta-val { color: #24292e; }
.description-box { margin-top: 10px; padding: 10px; background: #f6f8fa; border-radius: 4px; font-size: 13px; border: 1px solid #e1e4e8; }
.footer { margin-top: 30px; font-size: 12px; color: #6a737d; text-align: center; border-top: 1px solid #eaecef; padding-top: 15px; }
.btn { display: inline-block; padding: 6px 12px; font-size: 13px; font-weight: bold; color: #ffffff !important; background-color: #0366d6; text-decoration: none; border-radius: 4px; }
.btn-runbook { background-color: #28a745; margin-left: 10px; }
</style>
</head>
<body>
<div class="container">
<div class="header">
<h2>π¨π¦ Canada EKS Production Alerts</h2>
</div>
{{ range .Alerts }}
<div class="alert-item {{ if eq .Status "resolved" }}resolved{{ else }}{{ .Labels.severity }}{{ end }}">
<div class="title alert-title">
[{{ .Status | toUpper }}] {{ .Labels.alertname }}
</div>
<table class="meta-table">
<tr>
<td class="meta-label">Severity:</td>
<td class="meta-val"><span style="text-transform: capitalize; font-weight: bold; color: {{ if eq .Labels.severity "critical" }}#d9534f{{ else if eq .Labels.severity "warning" }}#ffc107{{ else }}#28a745{{ end }}">{{ .Labels.severity }}</span></td>
</tr>
{{ if .Labels.namespace }}
<tr>
<td class="meta-label">Namespace:</td>
<td class="meta-val"><code>{{ .Labels.namespace }}</code></td>
</tr>
{{ end }}
{{ if .Labels.pod }}
<tr>
<td class="meta-label">Pod:</td>
<td class="meta-val"><code>{{ .Labels.pod }}</code></td>
</tr>
{{ end }}
{{ if .Labels.container }}
<tr>
<td class="meta-label">Container:</td>
<td class="meta-val"><code>{{ .Labels.container }}</code></td>
</tr>
{{ end }}
{{ if .Annotations.summary }}
<tr>
<td class="meta-label">Summary:</td>
<td class="meta-val">{{ .Annotations.summary }}</td>
</tr>
{{ end }}
</table>
{{ if .Annotations.description }}
<div class="description-box">
<strong>Description:</strong><br/>
{{ .Annotations.description }}
</div>
{{ end }}
<div style="margin-top: 15px;">
<a class="btn" href="{{ .GeneratorURL }}" target="_blank">View in Prometheus</a>
{{ if .Annotations.runbook_url }}
<a class="btn btn-runbook" href="{{ .Annotations.runbook_url }}" target="_blank">Runbook</a>
{{ end }}
</div>
</div>
{{ end }}
<div class="footer">
<p>Alertmanager: <a href="{{ .ExternalURL }}">View Active Alerts Console</a></p>
<p style="font-size: 10px;">Sent automatically by Canada EKS Monitoring System.</p>
</div>
</div>
</body>
</html>
π Step 4: Automated Deployment Script (deploy.ps1)
Automate the process of adding the charts repository, updating dependencies, and running the deployment using this PowerShell script:
# deploy.ps1
$namespace = "monitoring"
$releaseName = "kube-prometheus-stack"
$helmRepoName = "prometheus-community"
$helmRepoUrl = "https://prometheus-community.github.io/helm-charts"
Write-Host "=== Starting EKS Monitoring Deployment ===" -ForegroundColor Green
# 1. Verify kubectl / helm
if (!(Get-Command helm -ErrorAction SilentlyContinue)) { Write-Error "Helm is missing."; exit 1 }
if (!(Get-Command kubectl -ErrorAction SilentlyContinue)) { Write-Error "kubectl is missing."; exit 1 }
# 2. Namespace
kubectl create namespace $namespace --dry-run=client -o yaml | kubectl apply -f -
# 3. Helm Repository
helm repo add $helmRepoName $helmRepoUrl
helm repo update $helmRepoName # Updates ONLY the targeted repository to prevent legacy issues
# 4. Deploy
helm upgrade --install $releaseName "$helmRepoName/kube-prometheus-stack" `
--namespace $namespace `
--values values.yaml `
--values values-secrets.yaml
if ($LASTEXITCODE -eq 0) {
Write-Host "=== Deployment completed successfully! ===" -ForegroundColor Green
} else {
Write-Error "=== Deployment failed! ==="
exit 1
}
To run:
.\deploy.ps1
π Conclusion
By grouping our Ingress routes behind a single AWS Application Load Balancer, we drastically cut down AWS Load Balancer costs. Setting up structured secrets keeps production parameters safe, and configuring a beautiful HTML template ensures that your on-call engineers get alerts that are readable, color-coded, and highly actionable. Happy helming!