DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Goal: page for user pain, not random metric spikes. Burn-rate alerts do that by measuring how fast you’re spending the error budget for your SLO.
We alert on two windows to catch both “oh-no right now” and “quietly getting worse.”
The threshold K you pick says “if we sustained this for the whole window, we’d consume X% of the total budget.” For a 28-day window, the mapping is simple: [ K ;=; \frac{\text{budget fraction to spend in the short window}}{\text{window hours}/672} ]
Examples (28-day window):
| Window | Spend of total EB | K (burn-rate threshold) |
|---|---|---|
| 1h | 2% | 13.44 |
| 2h | 2% | 6.72 |
| 6h | 5% | 5.60 |
| 12h | 5% | 2.80 |
| 24h | 10% | 2.80 |
A common pair for 28d is (2h, K~6.72) and (6h, K~5.6). (If you use a 30-day window, these become ~14.4 and ~6, respectively.)
Assume:
requests_total{status=~"5.."} = errors_totalrequests_total includes all (or just user-visible) requests# Helper: error ratio over 2h and 6h
slo:error_ratio:2h = increase(errors_total[2h]) / increase(requests_total[2h])
slo:error_ratio:6h = increase(errors_total[6h]) / increase(requests_total[6h])
# Error budget fraction (constant): 1 - 0.995 = 0.005
# Burn rates (ratio of observed error to allowed error per unit time)
slo:burn_rate:2h = slo:error_ratio:2h / (0.005 / 28d)
slo:burn_rate:6h = slo:error_ratio:6h / (0.005 / 28d)
Alerting rules (use two windows, two severities):
groups:
- name: slo-availability
rules:
- alert: SLOAvailabilityFastBurn
expr: |
(increase(errors_total[2h]) / increase(requests_total[2h]))
> (0.005 * (2h / 28d)) * 6.72
for: 5m
labels:
severity: page
slo: "availability-99.5-28d"
annotations:
summary: "SLO fast burn (2h) - spending error budget too fast"
runbook_url: "https://your.site/runbooks/slo-availability"
description: >
Error ratio over 2h is above fast-burn threshold for a 99.5% SLO.
Check deploys, dependency status, and queue lag.
- alert: SLOAvailabilitySlowBurn
expr: |
(increase(errors_total[6h]) / increase(requests_total[6h]))
> (0.005 * (6h / 28d)) * 5.6
for: 15m
labels:
severity: ticket
slo: "availability-99.5-28d"
annotations:
summary: "SLO slow burn (6h) - sustained error budget burn"
runbook_url: "https://your.site/runbooks/slo-availability"
You can also compute
slo:burn_rate:*as recording rules and alert on those to simplify expressions and dashboards.
Let the SLO be p95 < 250ms. We need a good-events vs all-events ratio:
http_request_duration_seconds_bucket{le=...}le="0.25" (≤ 250ms).slo:good_ratio:2h = (
sum(rate(http_request_duration_seconds_bucket{le="0.25"}[2h]))
/
sum(rate(http_request_duration_seconds_count[2h]))
)
slo:error_ratio:2h = 1 - slo:good_ratio:2h
slo:error_ratio:6h = 1 - (
sum(rate(http_request_duration_seconds_bucket{le="0.25"}[6h]))
/
sum(rate(http_request_duration_seconds_count[6h]))
)
# Burn rate (same idea, EB = 1 - SLO target)
# If target is "95% under 250ms" then EB = 0.05
slo:burn_rate_latency:2h = slo:error_ratio:2h / (0.05 / 28d)
slo:burn_rate_latency:6h = slo:error_ratio:6h / (0.05 / 28d)
Alerting:
- alert: SLOLatencyFastBurn
expr: slo:burn_rate_latency:2h > 6.72
for: 5m
labels: { severity: page, slo: "latency-p95-250ms-28d" }
annotations:
summary: "Latency SLO fast burn (2h)"
description: "p95 tail spending error budget too fast."
- alert: SLOLatencySlowBurn
expr: slo:burn_rate_latency:6h > 5.6
for: 15m
labels: { severity: ticket, slo: "latency-p95-250ms-28d" }
severity=ticket to business hours; keep severity=page 24/7.sum(rate(requests_total[2h])) > X to avoid flapping at low QPS.min-instances can help).Every burn alert must link a 1-page runbook with exact checks:
Aim to resolve or route in 10 minutes. Burn-rate alerts should be that actionable.
Outcome: alert volume ↓, actionability ↑, weekend peace restored.
This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.