How to Build an On-Call Rotation People Don’t Hate
On-call is a product. If it’s painful, it’s because the system is broken — not your engineers.
Most teams treat on-call as a staffing problem. It’s not. It’s an engineering design problem: alert quality, ownership boundaries, and the speed of mitigation.
The three failure modes
1. Paging for noise
If an alert doesn’t map to user impact, it’s not an alert — it’s a notification. Noise burns trust fast.
2. No fast mitigation path
If the only fix is “deploy a patch,” you’re doing incident response on hard mode. You need safe switches.
3. Undefined ownership
When everyone is responsible, nobody is. Incidents drag, handoffs get messy, and root causes never land.
The minimal on-call blueprint
- Service ownership map — every service has an owning team and an escalation path.
- SLO-linked alerts — alert on symptoms that correlate with user pain (latency, error rate, saturation).
- Kill switches — feature flags or circuit breakers for risky integrations and expensive workflows.
- Runbooks — top 5 alerts get a one-page runbook with: meaning, first checks, mitigation, escalation.
- Post-incident loop — one owner, one RCA, one follow-up list, tracked to completion.
A rule that changes everything
If an engineer gets paged for something they cannot fix within 15 minutes, the system is misdesigned. Either:
- the alert is wrong, or
- mitigation is missing, or
- ownership is unclear.
Fix that, and on-call stops feeling like punishment.
Disclaimer: This article is for general informational purposes only and does not constitute professional advice.