DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Goal: page for user pain, not random metric spikes. Burn-rate alerts do that by measuring how fast you’re spending the error budget for your SLO.


1) Concepts in 60 seconds

  • SLO window (T): e.g., 28 days -> 672 hours.
  • Target (S): e.g., 99.5% availability -> error budget (EB) = 1 − S = 0.005 of all requests in T.
  • Burn rate (BR): how fast you’re spending EB: [ \text{BR} = \frac{\text{observed error ratio over window}}{\text{error budget per unit time}} \quad \approx \quad \frac{\text{errors}/\text{requests}}{EB/T} ] If BR = 1, you’ll exactly exhaust the budget by the end of T. If BR = 10, you’d burn 10× too fast.

We alert on two windows to catch both “oh-no right now” and “quietly getting worse.”


2) Pick two windows (28-day SLO)

  • Fast burn: 2h over 28d -> page on-call now
  • Slow burn: 6h over 28d -> page during business hours

The threshold K you pick says “if we sustained this for the whole window, we’d consume X% of the total budget.” For a 28-day window, the mapping is simple: [ K ;=; \frac{\text{budget fraction to spend in the short window}}{\text{window hours}/672} ]

Examples (28-day window):

Window Spend of total EB K (burn-rate threshold)
1h 2% 13.44
2h 2% 6.72
6h 5% 5.60
12h 5% 2.80
24h 10% 2.80

A common pair for 28d is (2h, K~6.72) and (6h, K~5.6). (If you use a 30-day window, these become ~14.4 and ~6, respectively.)


3) Availability SLO - Prometheus examples

Assume:

  • requests_total{status=~"5.."} = errors_total
  • requests_total includes all (or just user-visible) requests
  • SLO: 99.5% availability over 28 days
# Helper: error ratio over 2h and 6h
slo:error_ratio:2h = increase(errors_total[2h]) / increase(requests_total[2h])
slo:error_ratio:6h = increase(errors_total[6h]) / increase(requests_total[6h])

# Error budget fraction (constant): 1 - 0.995 = 0.005
# Burn rates (ratio of observed error to allowed error per unit time)
slo:burn_rate:2h = slo:error_ratio:2h / (0.005 / 28d)
slo:burn_rate:6h = slo:error_ratio:6h / (0.005 / 28d)

Alerting rules (use two windows, two severities):

groups:
- name: slo-availability
  rules:
  - alert: SLOAvailabilityFastBurn
    expr: |
      (increase(errors_total[2h]) / increase(requests_total[2h]))
      > (0.005 * (2h / 28d)) * 6.72
    for: 5m
    labels:
      severity: page
      slo: "availability-99.5-28d"
    annotations:
      summary: "SLO fast burn (2h) - spending error budget too fast"
      runbook_url: "https://your.site/runbooks/slo-availability"
      description: >
        Error ratio over 2h is above fast-burn threshold for a 99.5% SLO.
        Check deploys, dependency status, and queue lag.

  - alert: SLOAvailabilitySlowBurn
    expr: |
      (increase(errors_total[6h]) / increase(requests_total[6h]))
      > (0.005 * (6h / 28d)) * 5.6
    for: 15m
    labels:
      severity: ticket
      slo: "availability-99.5-28d"
    annotations:
      summary: "SLO slow burn (6h) - sustained error budget burn"
      runbook_url: "https://your.site/runbooks/slo-availability"

You can also compute slo:burn_rate:* as recording rules and alert on those to simplify expressions and dashboards.


4) Latency SLO (p95) - Prometheus examples

Let the SLO be p95 < 250ms. We need a good-events vs all-events ratio:

  • Suppose histogram metrics: http_request_duration_seconds_bucket{le=...}
  • “Good” means le="0.25" (≤ 250ms).
  • Error ratio for latency SLO = bad / total = 1 − good/total
slo:good_ratio:2h = (
  sum(rate(http_request_duration_seconds_bucket{le="0.25"}[2h]))
/
  sum(rate(http_request_duration_seconds_count[2h]))
)

slo:error_ratio:2h = 1 - slo:good_ratio:2h
slo:error_ratio:6h = 1 - (
  sum(rate(http_request_duration_seconds_bucket{le="0.25"}[6h]))
/
  sum(rate(http_request_duration_seconds_count[6h]))
)

# Burn rate (same idea, EB = 1 - SLO target)
# If target is "95% under 250ms" then EB = 0.05
slo:burn_rate_latency:2h = slo:error_ratio:2h / (0.05 / 28d)
slo:burn_rate_latency:6h = slo:error_ratio:6h / (0.05 / 28d)

Alerting:

- alert: SLOLatencyFastBurn
  expr: slo:burn_rate_latency:2h > 6.72
  for: 5m
  labels: { severity: page, slo: "latency-p95-250ms-28d" }
  annotations:
    summary: "Latency SLO fast burn (2h)"
    description: "p95 tail spending error budget too fast."

- alert: SLOLatencySlowBurn
  expr: slo:burn_rate_latency:6h > 5.6
  for: 15m
  labels: { severity: ticket, slo: "latency-p95-250ms-28d" }

5) Tuning & good practices

  • Silence slow-burn on weekends/night: route severity=ticket to business hours; keep severity=page 24/7.
  • Minimum traffic guard: add sum(rate(requests_total[2h])) > X to avoid flapping at low QPS.
  • Per-endpoint vs service-level: start service-level, then carve out top endpoints that dominate the burn.
  • Warm-up suppression: ignore the first N minutes after deploy if you know cold starts cause noisy tails (Cloud Run min-instances can help).
  • Cost safety: pair burn alerts with budget spikes (FinOps) so you don’t scale into runaway cost during incidents.

6) Runbook - keep it one page

Every burn alert must link a 1-page runbook with exact checks:

  1. What changed? recent deploy? config? autoscaling?
  2. Dependencies: DB/redis/queue health; error signatures; saturation.
  3. Queues: lag vs throughput; enable backoff / DLQ if retry storms.
  4. Roll back or scale out: if deploy-related, roll back; else temporarily raise replicas / min-instances.
  5. If cost spikes: check egress/partitions, BigQuery scans, unbounded logs.
  6. Close-out: open a ticket if slow-burn fired; attach graphs + timeline.

Aim to resolve or route in 10 minutes. Burn-rate alerts should be that actionable.


7) Why this reduces noise (and saves weekends)

  • Two windows catch both explosions and smolders while ignoring momentary blips.
  • Math matches the budget: alerts fire only when you’re consuming a non-trivial fraction of the month’s error budget.
  • Clear actions: the runbook ties alerts to concrete fixes (deploy/rollback, dependency, queue, scaling, cost guardrails).

Outcome: alert volume ↓, actionability ↑, weekend peace restored.

This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.

© Copyright 2017-2025