SLO-Based Alerting and Burn Rates

Traditional alerting fires when error rate crosses a static threshold, like 'alert if errors > 1%'. What's wrong with that approach, and how would you set up SLO-based alerting instead?

Senior questions

SREadvanced

// interview question

Traditional alerting fires when error rate crosses a static threshold, like 'alert if errors > 1%'. What's wrong with that approach, and how would you set up SLO-based alerting instead?

Sample answer

Static threshold alerts have two problems. First, they fire too often on things that don't matter. A brief 2% error spike for 30 seconds won't affect your monthly SLO at all, but a static alert pages someone at 3am for it. Second, they miss slow burns. If your error rate sits at 0.5% for days, that might look fine against a 1% threshold, but it's quietly eating your 99.9% error budget. SLO-based alerting fixes both problems by alerting on burn rate -- how fast you're consuming your error budget relative to your window. The approach uses multi-window, multi-burn-rate alerts. You set up two kinds: 1. A fast-burn alert for acute incidents: "At this rate, you'll exhaust your entire 30-day error budget in 2 hours." This is a 36x burn rate (burning 36 times faster than sustainable). You check this over a short window (5 minutes) and a longer confirmation window (1 hour). This pages someone immediately. 2. A slow-burn alert for gradual degradation: "At this rate, you'll exhaust your budget in 3 days." This is a 10x burn rate, checked over a 30-minute short window and 6-hour long window. This creates a ticket, not a page. The two-window check prevents false positives. The short window catches the current condition, while the long window confirms it's not just a blip. Both must be true before the alert fires. This directly maps to Google's SRE book approach and it works well in practice because you only get paged for things that actually threaten your SLO.

Why this matters

This is a strong mid-to-senior question because it tests whether the candidate has actually operated SLO-based systems or just read about them. The multi-window burn rate concept trips up a lot of people. Listen for whether they understand why two windows are needed (short window alone gives false positives, long window alone alerts too late). Candidates who have done this in practice will mention specific burn rate numbers and the tradeoff between alert sensitivity and noise.

Code examples

Prometheus alerting rules for multi-window burn rate

yaml

groups:
  - name: slo-burn-rate-alerts
    rules:
      # Fast burn: 36x burn rate -- pages immediately
      # Will exhaust 30-day budget in ~2 hours
      - alert: HighErrorBudgetBurn_Critical
        expr: |
          (
            1 - (sum(rate(http_requests_total{service="checkout", status!~"5.."}[5m]))
            / sum(rate(http_requests_total{service="checkout"}[5m])))
          ) > (36 * 0.001)
          and
          (
            1 - (sum(rate(http_requests_total{service="checkout", status!~"5.."}[1h]))
            / sum(rate(http_requests_total{service="checkout"}[1h])))
          ) > (36 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Checkout error budget burning fast (36x rate)"
          description: "At current rate, 30-day error budget exhausts in ~2 hours."
          runbook: "https://runbooks.internal/slo-budget-burn"

      # Slow burn: 10x burn rate -- creates a ticket
      # Will exhaust 30-day budget in ~3 days
      - alert: HighErrorBudgetBurn_Warning
        expr: |
          (
            1 - (sum(rate(http_requests_total{service="checkout", status!~"5.."}[30m]))
            / sum(rate(http_requests_total{service="checkout"}[30m])))
          ) > (10 * 0.001)
          and
          (
            1 - (sum(rate(http_requests_total{service="checkout", status!~"5.."}[6h]))
            / sum(rate(http_requests_total{service="checkout"}[6h])))
          ) > (10 * 0.001)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Checkout error budget burn elevated (10x rate)"
          description: "At current rate, 30-day error budget exhausts in ~3 days."

Calculate current burn rate from Prometheus

bash

# Query current 1-hour error rate
ERROR_RATE=$(curl -s 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=1 - (sum(rate(http_requests_total{service="checkout", status!~"5.."}[1h])) / sum(rate(http_requests_total{service="checkout"}[1h])))' \
  | jq -r '.data.result[0].value[1]')

# Calculate burn rate relative to SLO budget
SLO_BUDGET=0.001  # 1 - 0.999
BURN_RATE=$(echo "scale=2; $ERROR_RATE / $SLO_BUDGET" | bc)

echo "Current error rate: $ERROR_RATE"
echo "Burn rate: ${BURN_RATE}x"
echo "Time to budget exhaustion: $(echo "scale=1; 720 / $BURN_RATE" | bc) hours"
# Output:
# Current error rate: 0.0035
# Burn rate: 3.50x
# Time to budget exhaustion: 205.7 hours

Common mistakes to avoid

Setting up SLO alerts but still keeping the old static threshold alerts running alongside them. This leads to alert fatigue because you get paged twice for the same incident from different systems.
Using only a single window for burn rate alerts. A short window alone gives false positives on brief spikes. A long window alone means you don't catch fast-moving incidents until significant budget is already gone.
Forgetting that burn rate alerts need enough request volume to be meaningful. For low-traffic services, a single failed request can show a 100x burn rate over a 5-minute window.

Likely follow-ups

Why do you need both a short window and a long window? What goes wrong if you only use one?
How would you tune these burn rates for a service that has natural traffic spikes, like an e-commerce site during sales events?
What's the relationship between burn rate multiplier and the time to budget exhaustion? How do you pick the right multipliers?
How do you handle SLO alerting for services with very low traffic where statistical significance is a problem?

Answer out loud first, then check yourself against the model answer.

Practice all Senior questions More SRE questions

#sre#slos#alerting#burn-rate#observability#on-call

Also worth your time on this topic

Interview

Choosing the Right SLIs

You're joining a team that runs a checkout service for an e-commerce platform. There are no SLOs yet. How would you decide which SLIs to track?

mid

Article

How to Build an Effective On-Call Rotation and Escalation Policy

Your phone buzzed at 3:14 AM for a disk warning that auto-resolved by 3:16. Nobody fixes the alert. The next person on rotation hates their life. Here is how to build on-call schedules, escalation policies, and alert rules that respect your engineers.

Checklist

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide

A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.

45-90 minutes

SLO-Based Alerting and Burn Rates

More SRE interview questions

Also worth your time on this topic

Choosing the Right SLIs

How to Build an Effective On-Call Rotation and Escalation Policy

SLOs, SLIs, and Error Budgets: A Practical Implementation Guide