DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Goal: page for user pain, not random metric spikes. Burn-rate alerts do that by measuring how fast you’re spending the error budget for your SLO.


1) Concepts in 60 seconds

  • SLO window (T): e.g., 28 days -> 672 hours.
  • Target (S): e.g., 99.5% availability -> error budget (EB) = 1 − S = 0.005 of all requests in T.
  • Burn rate (BR): how fast you’re spending EB: [ \text{BR} = \frac{\text{observed error ratio over window}}{\text{error budget per unit time}} \quad \approx \quad \frac{\text{errors}/\text{requests}}{EB/T} ] If BR = 1, you’ll exactly exhaust the budget by the end of T. If BR = 10, you’d burn 10× too fast.

We alert on two windows to catch both “oh-no right now” and “quietly getting worse.”


2) Pick two windows (28-day SLO)

  • Fast burn: 2h over 28d -> page on-call now
  • Slow burn: 6h over 28d -> page during business hours

The threshold K you pick says “if we sustained this for the whole window, we’d consume X% of the total budget.” For a 28-day window, the mapping is simple: [ K ;=; \frac{\text{budget fraction to spend in the short window}}{\text{window hours}/672} ]

Examples (28-day window):

Window Spend of total EB K (burn-rate threshold)
1h 2% 13.44
2h 2% 6.72
6h 5% 5.60
12h 5% 2.80
24h 10% 2.80

A common pair for 28d is (2h, K~6.72) and (6h, K~5.6). (If you use a 30-day window, these become ~14.4 and ~6, respectively.)


3) Availability SLO - Prometheus examples

Assume:

  • requests_total{status=~"5.."} = errors_total
  • requests_total includes all (or just user-visible) requests
  • SLO: 99.5% availability over 28 days
# Helper: error ratio over 2h and 6h
slo:error_ratio:2h = increase(errors_total[2h]) / increase(requests_total[2h])
slo:error_ratio:6h = increase(errors_total[6h]) / increase(requests_total[6h])

# Error budget fraction (constant): 1 - 0.995 = 0.005
# Burn rates (ratio of observed error to allowed error per unit time)
slo:burn_rate:2h = slo:error_ratio:2h / (0.005 / 28d)
slo:burn_rate:6h = slo:error_ratio:6h / (0.005 / 28d)

Alerting rules (use two windows, two severities):

groups:
- name: slo-availability
  rules:
  - alert: SLOAvailabilityFastBurn
    expr: |
      (increase(errors_total[2h]) / increase(requests_total[2h]))
      > (0.005 * (2h / 28d)) * 6.72
    for: 5m
    labels:
      severity: page
      slo: "availability-99.5-28d"
    annotations:
      summary: "SLO fast burn (2h) - spending error budget too fast"
      runbook_url: "https://your.site/runbooks/slo-availability"
      description: >
        Error ratio over 2h is above fast-burn threshold for a 99.5% SLO.
        Check deploys, dependency status, and queue lag.

  - alert: SLOAvailabilitySlowBurn
    expr: |
      (increase(errors_total[6h]) / increase(requests_total[6h]))
      > (0.005 * (6h / 28d)) * 5.6
    for: 15m
    labels:
      severity: ticket
      slo: "availability-99.5-28d"
    annotations:
      summary: "SLO slow burn (6h) - sustained error budget burn"
      runbook_url: "https://your.site/runbooks/slo-availability"

You can also compute slo:burn_rate:* as recording rules and alert on those to simplify expressions and dashboards.


4) Latency SLO (p95) - Prometheus examples

Let the SLO be p95 < 250ms. We need a good-events vs all-events ratio:

  • Suppose histogram metrics: http_request_duration_seconds_bucket{le=...}
  • “Good” means le="0.25" (≤ 250ms).
  • Error ratio for latency SLO = bad / total = 1 − good/total
slo:good_ratio:2h = (
  sum(rate(http_request_duration_seconds_bucket{le="0.25"}[2h]))
/
  sum(rate(http_request_duration_seconds_count[2h]))
)

slo:error_ratio:2h = 1 - slo:good_ratio:2h
slo:error_ratio:6h = 1 - (
  sum(rate(http_request_duration_seconds_bucket{le="0.25"}[6h]))
/
  sum(rate(http_request_duration_seconds_count[6h]))
)

# Burn rate (same idea, EB = 1 - SLO target)
# If target is "95% under 250ms" then EB = 0.05
slo:burn_rate_latency:2h = slo:error_ratio:2h / (0.05 / 28d)
slo:burn_rate_latency:6h = slo:error_ratio:6h / (0.05 / 28d)

Alerting:

- alert: SLOLatencyFastBurn
  expr: slo:burn_rate_latency:2h > 6.72
  for: 5m
  labels: { severity: page, slo: "latency-p95-250ms-28d" }
  annotations:
    summary: "Latency SLO fast burn (2h)"
    description: "p95 tail spending error budget too fast."

- alert: SLOLatencySlowBurn
  expr: slo:burn_rate_latency:6h > 5.6
  for: 15m
  labels: { severity: ticket, slo: "latency-p95-250ms-28d" }

5) Tuning & good practices

  • Silence slow-burn on weekends/night: route severity=ticket to business hours; keep severity=page 24/7.
  • Minimum traffic guard: add sum(rate(requests_total[2h])) > X to avoid flapping at low QPS.
  • Per-endpoint vs service-level: start service-level, then carve out top endpoints that dominate the burn.
  • Warm-up suppression: ignore the first N minutes after deploy if you know cold starts cause noisy tails (Cloud Run min-instances can help).
  • Cost safety: pair burn alerts with budget spikes (FinOps) so you don’t scale into runaway cost during incidents.

6) Runbook - keep it one page

Every burn alert must link a 1-page runbook with exact checks:

  1. What changed? recent deploy? config? autoscaling?
  2. Dependencies: DB/redis/queue health; error signatures; saturation.
  3. Queues: lag vs throughput; enable backoff / DLQ if retry storms.
  4. Roll back or scale out: if deploy-related, roll back; else temporarily raise replicas / min-instances.
  5. If cost spikes: check egress/partitions, BigQuery scans, unbounded logs.
  6. Close-out: open a ticket if slow-burn fired; attach graphs + timeline.

Aim to resolve or route in 10 minutes. Burn-rate alerts should be that actionable.


7) Why this reduces noise (and saves weekends)

  • Two windows catch both explosions and smolders while ignoring momentary blips.
  • Math matches the budget: alerts fire only when you’re consuming a non-trivial fraction of the month’s error budget.
  • Clear actions: the runbook ties alerts to concrete fixes (deploy/rollback, dependency, queue, scaling, cost guardrails).

Outcome: alert volume ↓, actionability ↑, weekend peace restored.

© Copyright 2017-2025