The first time you get paged at 2 a.m. for a major outage, your brain goes straight into panic mode. Customers are complaining, the CTO is asking for updates, and every second feels like a lifetime.
If you’re not prepared, theory doesn’t matter. You need a clear, actionable checklist — and the muscle memory to execute it.
Looking back on my first major incident, I wish I’d had this exact DevOps checklist taped above my monitor. Today, you can.
1. Map Your Dependencies
Why? Because production rarely fails in isolation.
Before any deploy, you should know:
Which databases your service depends on
Which external APIs you call
Which queues, caches, or pub/sub systems you use
What depends on you downstream
Practical tip: maintain a simple dependencies.md
in your repo:
If you can’t write this down in 2 minutes, you’re already flying blind.
2. Define “Healthy” for Your Service
Why? If you don’t define “healthy,” you can’t measure brokenness.
Set:
Normal latency (e.g., p95 under 400ms)
Acceptable error rates (e.g., <0.5%)
SLOs with clear boundaries
Prometheus alert example:
No SLO, no clear alerting → no clue what to fix.
3. Write Runbooks Before You Need Them
You will forget everything under stress. A good runbook is priceless.
What to include:
✅ Steps to restart the service
✅ Where logs live
✅ Where metrics live
✅ Escalation contacts
✅ Known gotchas (like tricky feature flags)
Example snippet in a runbook.md
:
4. Plan for Rollbacks
Never deploy without a way back.
Terraform rollback example:
Before you terraform apply
:
If it goes wrong:
Ansible rollback example:
If deploy fails:
Rule of thumb: If you haven’t practiced the rollback, you don’t have a rollback.
5. Test Alerts Like You Test Code
False alarms? Dead alerts? Both are equally dangerous.
Test example with Prometheus:
Verify:
✅ Did your alert fire?
✅ Did it page the right team?
✅ Did the escalation policy work?
No surprises in production.
6. Make Logs Actually Useful
During a P1, your logs are your best friend. But only if they’re readable:
✅ JSON or structured logs
✅ Timestamps in UTC
✅ Consistent field names
✅ Centralized in a tool like Loki, Elasticsearch, or Datadog
Example with Serilog (.NET):
Searchable, parseable logs = faster triage.
7. Practice a Real Incident Drill
Humans panic under pressure. Simulation is the fix.
✅ Schedule a “chaos hour”
✅ Break something intentionally in staging
✅ Run through your runbook
✅ Validate comms in Slack / Teams
✅ Record time to recover
Think of it as a fire drill for your service.
8. Plan Incident Command Roles
During an incident, roles clarify chaos. Define:
Incident commander (makes final decisions)
Communicator (posts updates to execs, status pages, Slack)
Scribe (documents timelines)
Operators (technical responders)
Example Slack message to assign roles fast:
Simple, repeatable.
9. Understand Business Impact
Every stakeholder will ask:
Who is affected?
How much revenue is at risk?
Is there data loss or privacy risk?
Keep a one-liner ready in your runbook:
“This outage blocks checkout for all EU customers, potential revenue loss of ~$2,000/hour.”
No guessing.
10. Remember Humans Matter
Last but not least:
✅ Communicate clearly and regularly
✅ Be honest about unknowns
✅ Ask for help if you’re overwhelmed
✅ Stay calm and treat your teammates with respect
Incidents are stressful, but how you show up will define your credibility and trust with the team.
Real DevOps Scripts to Keep Handy
To make this checklist even more actionable, copy these to your personal toolbox:
✅ Terraform state snapshot before deploy
✅ Ansible NGINX config backup
✅ Failover test for Postgres
✅ Simple Slack comms snippet
✅ Chaos testing in Kubernetes
Final Thoughts
Production outages are inevitable. Panicking is optional.
This checklist is what I wish someone had handed me before my first real incident: practical, battle-tested, no fluff. Tape it to your monitor, print it for your team, build it into your onboarding.
Because when your pager goes off at 2 a.m., the only thing that matters is being ready.
NEVER MISS A THING!
Subscribe and get freshly baked articles. Join the community!
Join the newsletter to receive the latest updates in your inbox.