DR. ATABAK KH
Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.
Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator
Goal: page for user pain, not random metric spikes. Burn-rate alerts do that by measuring how fast you’re spending the error budget for your SLO.
We alert on two windows to catch both “oh-no right now” and “quietly getting worse.”
The threshold K you pick says “if we sustained this for the whole window, we’d consume X% of the total budget.” For a 28-day window, the mapping is simple: [ K ;=; \frac{\text{budget fraction to spend in the short window}}{\text{window hours}/672} ]
Examples (28-day window):
| Window | Spend of total EB | K (burn-rate threshold) |
|---|---|---|
| 1h | 2% | 13.44 |
| 2h | 2% | 6.72 |
| 6h | 5% | 5.60 |
| 12h | 5% | 2.80 |
| 24h | 10% | 2.80 |
A common pair for 28d is (2h, K~6.72) and (6h, K~5.6). (If you use a 30-day window, these become ~14.4 and ~6, respectively.)
Assume:
requests_total{status=~"5.."} = errors_totalrequests_total includes all (or just user-visible) requests# Helper: error ratio over 2h and 6h
slo:error_ratio:2h = increase(errors_total[2h]) / increase(requests_total[2h])
slo:error_ratio:6h = increase(errors_total[6h]) / increase(requests_total[6h])
# Error budget fraction (constant): 1 - 0.995 = 0.005
# Burn rates (ratio of observed error to allowed error per unit time)
slo:burn_rate:2h = slo:error_ratio:2h / (0.005 / 28d)
slo:burn_rate:6h = slo:error_ratio:6h / (0.005 / 28d)
Alerting rules (use two windows, two severities):
groups:
- name: slo-availability
rules:
- alert: SLOAvailabilityFastBurn
expr: |
(increase(errors_total[2h]) / increase(requests_total[2h]))
> (0.005 * (2h / 28d)) * 6.72
for: 5m
labels:
severity: page
slo: "availability-99.5-28d"
annotations:
summary: "SLO fast burn (2h) - spending error budget too fast"
runbook_url: "https://your.site/runbooks/slo-availability"
description: >
Error ratio over 2h is above fast-burn threshold for a 99.5% SLO.
Check deploys, dependency status, and queue lag.
- alert: SLOAvailabilitySlowBurn
expr: |
(increase(errors_total[6h]) / increase(requests_total[6h]))
> (0.005 * (6h / 28d)) * 5.6
for: 15m
labels:
severity: ticket
slo: "availability-99.5-28d"
annotations:
summary: "SLO slow burn (6h) - sustained error budget burn"
runbook_url: "https://your.site/runbooks/slo-availability"
You can also compute
slo:burn_rate:*as recording rules and alert on those to simplify expressions and dashboards.
Let the SLO be p95 < 250ms. We need a good-events vs all-events ratio:
http_request_duration_seconds_bucket{le=...}le="0.25" (≤ 250ms).slo:good_ratio:2h = (
sum(rate(http_request_duration_seconds_bucket{le="0.25"}[2h]))
/
sum(rate(http_request_duration_seconds_count[2h]))
)
slo:error_ratio:2h = 1 - slo:good_ratio:2h
slo:error_ratio:6h = 1 - (
sum(rate(http_request_duration_seconds_bucket{le="0.25"}[6h]))
/
sum(rate(http_request_duration_seconds_count[6h]))
)
# Burn rate (same idea, EB = 1 - SLO target)
# If target is "95% under 250ms" then EB = 0.05
slo:burn_rate_latency:2h = slo:error_ratio:2h / (0.05 / 28d)
slo:burn_rate_latency:6h = slo:error_ratio:6h / (0.05 / 28d)
Alerting:
- alert: SLOLatencyFastBurn
expr: slo:burn_rate_latency:2h > 6.72
for: 5m
labels: { severity: page, slo: "latency-p95-250ms-28d" }
annotations:
summary: "Latency SLO fast burn (2h)"
description: "p95 tail spending error budget too fast."
- alert: SLOLatencySlowBurn
expr: slo:burn_rate_latency:6h > 5.6
for: 15m
labels: { severity: ticket, slo: "latency-p95-250ms-28d" }
severity=ticket to business hours; keep severity=page 24/7.sum(rate(requests_total[2h])) > X to avoid flapping at low QPS.min-instances can help).Every burn alert must link a 1-page runbook with exact checks:
Aim to resolve or route in 10 minutes. Burn-rate alerts should be that actionable.
Outcome: alert volume ↓, actionability ↑, weekend peace restored.