Error Budget Burn Investigation
It's Monday morning. You check the dashboard and see that your service burned 80% of its monthly error budget over the weekend. Walk me through how you'd investigate this and what you'd do next.
// interview question
It's Monday morning. You check the dashboard and see that your service burned 80% of its monthly error budget over the weekend. Walk me through how you'd investigate this and what you'd do next.
Answer out loud first, then check yourself against the model answer.
More SRE interview questions
Also worth your time on this topic
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
A step-by-step checklist for defining service level objectives, picking the right service level indicators, and using error budgets to make better decisions about reliability vs. feature velocity.
45-90 minutes
Error Budget Management
Your service has a 99.9% availability SLO over a 30-day window. How much downtime does that give you, and what do you actually do with that error budget day-to-day?
mid
SLOs, SLIs, and Error Budgets: A Practical Implementation Guide
Your service went down at 2 AM and nobody could agree on whether it was "bad enough" to page someone. SLOs, SLIs, and error budgets fix that. Here is how to define, measure, and act on them with real Prometheus queries and alerting rules.