Cerebrix

Monday, January 13, 2025

The DevOps Checklist I Wish I’d Had Before My First Outage

Louisa Medina

The first time you get paged at 2 a.m. for a major outage, your brain goes straight into panic mode. Customers are complaining, the CTO is asking for updates, and every second feels like a lifetime.

If you’re not prepared, theory doesn’t matter. You need a clear, actionable checklist — and the muscle memory to execute it.

Looking back on my first major incident, I wish I’d had this exact DevOps checklist taped above my monitor. Today, you can.

1. Map Your Dependencies

Why? Because production rarely fails in isolation.

Before any deploy, you should know:

Which databases your service depends on
Which external APIs you call
Which queues, caches, or pub/sub systems you use
What depends on you downstream

Practical tip: maintain a simple dependencies.md in your repo:

If you can’t write this down in 2 minutes, you’re already flying blind.

2. Define “Healthy” for Your Service

Why? If you don’t define “healthy,” you can’t measure brokenness.

Set:

Normal latency (e.g., p95 under 400ms)
Acceptable error rates (e.g., <0.5%)
SLOs with clear boundaries

Prometheus alert example:

groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status="5xx"}[5m]

No SLO, no clear alerting → no clue what to fix.

3. Write Runbooks Before You Need Them

You will forget everything under stress. A good runbook is priceless.

What to include:

✅ Steps to restart the service
✅ Where logs live
✅ Where metrics live
✅ Escalation contacts
✅ Known gotchas (like tricky feature flags)

Example snippet in a runbook.md:

4. Plan for Rollbacks

Never deploy without a way back.

Terraform rollback example:

Before you terraform apply:

If it goes wrong:

Ansible rollback example:

If deploy fails:

Rule of thumb: If you haven’t practiced the rollback, you don’t have a rollback.

5. Test Alerts Like You Test Code

False alarms? Dead alerts? Both are equally dangerous.

Test example with Prometheus:

Verify:

✅ Did your alert fire?
✅ Did it page the right team?
✅ Did the escalation policy work?

No surprises in production.

6. Make Logs Actually Useful

During a P1, your logs are your best friend. But only if they’re readable:

✅ JSON or structured logs
✅ Timestamps in UTC
✅ Consistent field names
✅ Centralized in a tool like Loki, Elasticsearch, or Datadog

Example with Serilog (.NET):

Searchable, parseable logs = faster triage.

7. Practice a Real Incident Drill

Humans panic under pressure. Simulation is the fix.

✅ Schedule a “chaos hour”
✅ Break something intentionally in staging
✅ Run through your runbook
✅ Validate comms in Slack / Teams
✅ Record time to recover

Think of it as a fire drill for your service.

8. Plan Incident Command Roles

During an incident, roles clarify chaos. Define:

Incident commander (makes final decisions)
Communicator (posts updates to execs, status pages, Slack)
Scribe (documents timelines)
Operators (technical responders)

Example Slack message to assign roles fast:

Simple, repeatable.

9. Understand Business Impact

Every stakeholder will ask:

Who is affected?
How much revenue is at risk?
Is there data loss or privacy risk?

Keep a one-liner ready in your runbook:

“This outage blocks checkout for all EU customers, potential revenue loss of ~$2,000/hour.”

No guessing.

10. Remember Humans Matter

Last but not least:

✅ Communicate clearly and regularly
✅ Be honest about unknowns
✅ Ask for help if you’re overwhelmed
✅ Stay calm and treat your teammates with respect

Incidents are stressful, but how you show up will define your credibility and trust with the team.

Real DevOps Scripts to Keep Handy

To make this checklist even more actionable, copy these to your personal toolbox:

✅ Terraform state snapshot before deploy

✅ Ansible NGINX config backup

✅ Failover test for Postgres

✅ Simple Slack comms snippet

✅ Chaos testing in Kubernetes

Final Thoughts

Production outages are inevitable. Panicking is optional.

This checklist is what I wish someone had handed me before my first real incident: practical, battle-tested, no fluff. Tape it to your monitor, print it for your team, build it into your onboarding.

Because when your pager goes off at 2 a.m., the only thing that matters is being ready.

NEVER MISS A THING!

Subscribe and get freshly baked articles. Join the community!

Join the newsletter to receive the latest updates in your inbox.

July 24, 2025

The DevOps Checklist I Wish I’d Had Before My First Outage

Louisa Medina

1. Map Your Dependencies

2. Define “Healthy” for Your Service

3. Write Runbooks Before You Need Them

4. Plan for Rollbacks

5. Test Alerts Like You Test Code

6. Make Logs Actually Useful

7. Practice a Real Incident Drill

8. Plan Incident Command Roles

9. Understand Business Impact

10. Remember Humans Matter

Real DevOps Scripts to Keep Handy

Final Thoughts

Related post

5 Git Mistakes You’re Probably Making (and Don’t Know It)

5 Git Mistakes You’re Probably Making (and Don’t Know It)

5 Git Mistakes You’re Probably Making (and Don’t Know It)

How to Upload Files to S3 Using Presigned URLs (Securely)

How to Upload Files to S3 Using Presigned URLs (Securely)

How to Upload Files to S3 Using Presigned URLs (Securely)

The Terraform Trap: When IaC Becomes Unmaintainable

The Terraform Trap: When IaC Becomes Unmaintainable

The Terraform Trap: When IaC Becomes Unmaintainable