Monday, January 13, 2025

The DevOps Checklist I Wish I’d Had Before My First Outage

downtime

The first time you get paged at 2 a.m. for a major outage, your brain goes straight into panic mode. Customers are complaining, the CTO is asking for updates, and every second feels like a lifetime.

If you’re not prepared, theory doesn’t matter. You need a clear, actionable checklist — and the muscle memory to execute it.

Looking back on my first major incident, I wish I’d had this exact DevOps checklist taped above my monitor. Today, you can.

1. Map Your Dependencies

Why? Because production rarely fails in isolation.

Before any deploy, you should know:

  • Which databases your service depends on

  • Which external APIs you call

  • Which queues, caches, or pub/sub systems you use

  • What depends on you downstream

Practical tip: maintain a simple dependencies.md in your repo:


If you can’t write this down in 2 minutes, you’re already flying blind.

2. Define “Healthy” for Your Service

Why? If you don’t define “healthy,” you can’t measure brokenness.

Set:

  • Normal latency (e.g., p95 under 400ms)

  • Acceptable error rates (e.g., <0.5%)

  • SLOs with clear boundaries

Prometheus alert example:

groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status="5xx"}[5m]

No SLO, no clear alerting → no clue what to fix.

3. Write Runbooks Before You Need Them

You will forget everything under stress. A good runbook is priceless.

What to include:

✅ Steps to restart the service
✅ Where logs live
✅ Where metrics live
✅ Escalation contacts
✅ Known gotchas (like tricky feature flags)

Example snippet in a runbook.md:


4. Plan for Rollbacks

Never deploy without a way back.

Terraform rollback example:

Before you terraform apply:

If it goes wrong:


Ansible rollback example:


If deploy fails:


Rule of thumb: If you haven’t practiced the rollback, you don’t have a rollback.

5. Test Alerts Like You Test Code

False alarms? Dead alerts? Both are equally dangerous.

Test example with Prometheus:


Verify:

✅ Did your alert fire?
✅ Did it page the right team?
✅ Did the escalation policy work?

No surprises in production.

6. Make Logs Actually Useful

During a P1, your logs are your best friend. But only if they’re readable:

✅ JSON or structured logs
✅ Timestamps in UTC
✅ Consistent field names
✅ Centralized in a tool like Loki, Elasticsearch, or Datadog

Example with Serilog (.NET):


Searchable, parseable logs = faster triage.

7. Practice a Real Incident Drill

Humans panic under pressure. Simulation is the fix.

✅ Schedule a “chaos hour”
✅ Break something intentionally in staging
✅ Run through your runbook
✅ Validate comms in Slack / Teams
✅ Record time to recover

Think of it as a fire drill for your service.

8. Plan Incident Command Roles

During an incident, roles clarify chaos. Define:

  • Incident commander (makes final decisions)

  • Communicator (posts updates to execs, status pages, Slack)

  • Scribe (documents timelines)

  • Operators (technical responders)

Example Slack message to assign roles fast:


Simple, repeatable.

9. Understand Business Impact

Every stakeholder will ask:

  • Who is affected?

  • How much revenue is at risk?

  • Is there data loss or privacy risk?

Keep a one-liner ready in your runbook:

“This outage blocks checkout for all EU customers, potential revenue loss of ~$2,000/hour.”

No guessing.

10. Remember Humans Matter

Last but not least:

✅ Communicate clearly and regularly
✅ Be honest about unknowns
✅ Ask for help if you’re overwhelmed
✅ Stay calm and treat your teammates with respect

Incidents are stressful, but how you show up will define your credibility and trust with the team.

Real DevOps Scripts to Keep Handy

To make this checklist even more actionable, copy these to your personal toolbox:

Terraform state snapshot before deploy

Ansible NGINX config backup


Failover test for Postgres

Simple Slack comms snippet


Chaos testing in Kubernetes

Final Thoughts

Production outages are inevitable. Panicking is optional.

This checklist is what I wish someone had handed me before my first real incident: practical, battle-tested, no fluff. Tape it to your monitor, print it for your team, build it into your onboarding.

Because when your pager goes off at 2 a.m., the only thing that matters is being ready.

NEVER MISS A THING!

Subscribe and get freshly baked articles. Join the community!

Join the newsletter to receive the latest updates in your inbox.

Footer Background

About Cerebrix

Smarter Technology Journalism.

Explore the technology shaping tomorrow with Cerebrix — your trusted source for insightful, in-depth coverage of engineering, cloud, AI, and developer culture. We go beyond the headlines, delivering clear, authoritative analysis and feature reporting that helps you navigate an ever-evolving tech landscape.

From breaking innovations to industry-shifting trends, Cerebrix empowers you to stay ahead with accurate, relevant, and thought-provoking stories. Join us to discover the future of technology — one article at a time.

2025 © CEREBRIX. Design by FRANCK KENGNE.

Footer Background

About Cerebrix

Smarter Technology Journalism.

Explore the technology shaping tomorrow with Cerebrix — your trusted source for insightful, in-depth coverage of engineering, cloud, AI, and developer culture. We go beyond the headlines, delivering clear, authoritative analysis and feature reporting that helps you navigate an ever-evolving tech landscape.

From breaking innovations to industry-shifting trends, Cerebrix empowers you to stay ahead with accurate, relevant, and thought-provoking stories. Join us to discover the future of technology — one article at a time.

2025 © CEREBRIX. Design by FRANCK KENGNE.

Footer Background

About Cerebrix

Smarter Technology Journalism.

Explore the technology shaping tomorrow with Cerebrix — your trusted source for insightful, in-depth coverage of engineering, cloud, AI, and developer culture. We go beyond the headlines, delivering clear, authoritative analysis and feature reporting that helps you navigate an ever-evolving tech landscape.

From breaking innovations to industry-shifting trends, Cerebrix empowers you to stay ahead with accurate, relevant, and thought-provoking stories. Join us to discover the future of technology — one article at a time.

2025 © CEREBRIX. Design by FRANCK KENGNE.