It's 2 AM, and Everything Is Down

Your phone is buzzing. Multiple alerts. Teams is exploding. Your senior engineer just called - the kind of call that means "this is bad."

Your data center network just went down. Hard. Multiple sites are offline. Revenue-generating systems are dark. Customers are impacted, and your CEO is about to wake up to very angry messages.

Six months ago, you would have logged in and started troubleshooting.

Now you're a manager. Your job is different, and nobody actually prepared you for this moment.

This isn't about fixing the technical problem - your team will handle that. This is about leading people through a crisis, managing organizational chaos, communicating effectively under pressure, and ensuring that you learn from the experience without compromising morale.

Let me walk you through what actually matters when you're managing your team through a major outage. Not the theory from incident management frameworks, but the reality of what it's like when everything is on fire, and you're responsible for both fixing it and leading people through it.

The First 30 Minutes: When Everything Is Chaos

Your Job Is NOT to Fix the Problem

The instinct:

You want to dive in. Start troubleshooting. Check configs. Review logs. DO something technical to fix this.

Resist that instinct.

Your job as a manager during the first 30 minutes:

Understand the scope - What's actually down? What's still working? How many customers/sites/users are impacted?
Ensure the right people are engaged - Who's already working on this? Who else needs to be looped in? Do we have the expertise we need?
Establish communication - Where is the team coordinating? Slack channel? Conference bridge? War room?
Start the timeline - Someone needs to document what's happening and when. If your team is small and everyone is troubleshooting, this might be you.
Begin stakeholder communication - Leadership needs to know this is happening and that it's being handled.

What I Learned:

The first time I managed through a major outage, I jumped into troubleshooting with my team. Felt productive. Felt useful.

But nobody was coordinating. Nobody was communicating with leadership. Nobody was documenting. Nobody was thinking about the big picture.

When we finally resolved the issue 4 hours later, leadership was furious - not about the outage, but about the information vacuum. They heard about the problem from customers before they heard from us.

The Hard Truth:

Your team doesn't need another troubleshooter. They need a leader who's managing the chaos so they can focus on fixing the problem.

Setting Up the War Room

Physical or Virtual:

Get your team in one place - whether that's a conference room, a Zoom call, or a dedicated Slack channel. You need centralized coordination.

Who Needs to Be There:

Engineers actively troubleshooting
Subject matter experts for affected systems
Someone documenting (even if it's rough notes)
You (managing coordination and communication)

Who Should NOT Be There:

People who can't contribute to the resolution
Leadership wants updates (give them updates separately)
Spectators (they're distracting)

The Structure:

Every 15-30 minutes, do a quick sync:

Current status update
What we've tried
What we're trying next
What we've ruled out
Any blockers or needs

This keeps everyone aligned without constant interruption.

Your Tone Matters

What Your Team Is Feeling:

Stress
Pressure
Maybe panic
Possibly guilt if they think they caused this
Fear of consequences

What They Need From You:

Calm. Confidence. Support. Focus.

Not:

Panic
Blame
Micromanagement
Adding to the pressure

The Example:

When that 2 AM outage hit, and I saw the scope, my first internal reaction was "oh shit, we're screwed, this is bad, leadership is going to lose it."

My first words to the team: "Okay, we've got this. Let's figure out what happened and get systems back online. What do we know so far?"

Fake it if you have to. Your team takes emotional cues from you. If you're calm and focused, they'll be calmer and more focused.

Managing stress and team dynamics under pressure relates to what I discussed in Both Sides of the Desk: Burnout - crisis management is a fast track to exhaustion if not handled carefully.

During the Outage: Leading While Others Fix

Managing Communication Upward

Leadership wants updates. Frequently. Even when you don't have new information.

The Template That Works:

Every 30-60 minutes, send a brief update:

"Network outage update as of [time]:

Current status: [what's down, what's working]
Impact: [customers/sites/users affected]
Root cause: [if known, or 'still investigating']
ETA for resolution: [best estimate or 'unknown at this time']
Next update: [timeframe]"

What NOT to Do:

Don't go silent. Don't wait until you have "complete information." Leadership sitting in the dark is worse than getting updates saying "still investigating."

The Uncomfortable Reality:

You might not know what's wrong yet. You might not have an ETA. That's okay. Say that explicitly rather than making up answers.

Bad: "We should have this fixed in 30 minutes." (When you have no idea)

Good: "We're actively troubleshooting. We've ruled out X and Y, investigating Z now. Will update in 30 minutes regardless of status."

Protecting Your Team From Organizational Chaos

What Happens During Outages:

Everyone wants updates. Sales is getting hammered by customers. Support is drowning in tickets. Executives are demanding answers. Other teams are trying to understand impact.

Your Job:

Be the filter. Take the incoming chaos and translate it into useful information for your team without distracting them.

The Shield:

"Hey team, FYI: leadership is asking about ETA. I told them we're investigating and will update when we know more. You all focus on resolution, I'll handle stakeholder communication."

What Your Team Needs:

To focus on fixing the problem without being interrupted every 10 minutes by someone asking "what's the status?"

What You Provide:

That focus. You absorb the organizational pressure and redirect it away from the people actually solving the problem.

Making Real-Time Decisions

You'll be asked:

"Should we fail over to backup site?" "Should we roll back the change?" "Should we engage vendor support?" "Should we wake up [senior person]?"

Your Job:

Make decisions based on input from your team. You might not know the technical details, but you can assess risk, impact, and trade-offs.

The Framework:

What's the risk of this action?
What's the risk of NOT taking this action?
What's the potential impact?
Do we have the expertise to do this safely?
Can we reverse this if it doesn't work?

Don't Pretend to Know:

"I'm not sure. [Engineer], what's your assessment? What do you recommend and why?"

Then make the call based on their input.

When to Overrule:

If the recommended action seems too risky given business impact, or if your team is so deep in troubleshooting they're not thinking about broader consequences.

"I hear you that we could try X, but given that we're already impacting revenue, I'm not comfortable with the additional risk. Let's try Y first."

The Human Element: Taking Care of Your Team

Monitoring for Burnout in Real-Time

Long outages are exhausting. Mentally, physically, emotionally.

Watch for:

People who've been troubleshooting for 4+ hours straight
Frustration turning into defeated resignation
Tunnel vision (focused on one theory and can't see alternatives)
Mistakes that indicate fatigue

Your Intervention:

"Hey [engineer], you've been at this for 5 hours. Take a 15-minute break. Walk around. Get food. Clear your head. We've got coverage."

They'll resist: "I'm fine, I want to keep working on this."

Your response: "I know. Take the break anyway. You'll be more effective after 15 minutes away from it."

The Reality:

Exhausted engineers make mistakes that extend outages. Forcing breaks is sometimes the fastest path to resolution.

Managing the "I Think I Caused This" Moment

What Often Happens:

Mid-incident, someone realizes their change from earlier might have caused or contributed to the outage.

The Emotional Load:

Guilt. Fear. Shame. Panic about consequences.

Your Response (Immediately):

"Okay, that's useful information. Let's focus on resolution right now. We'll do a full review later, where we look at process and systemic issues, not individual blame. Right now, I just need you focused on fixing this."

What This Does:

Acknowledges the information without making it about blame
Keeps focus on resolution, not fault
Signals that post-mortem will be constructive, not punitive
Allows the person to stay focused rather than spiraling

What NOT to Do:

"What?! Why did you make that change without testing?!"

That person is now useless for the rest of the incident and is possibly looking for a new job.

Managing Your Own Stress

You're human too. Major outages are stressful for managers.

You're Feeling:

Responsibility for the impact
Pressure from leadership
Concern for your team
Anxiety about consequences
Exhaustion (especially if it's 3 AM)

You Still Have to Lead:

Your team needs you to be calm and focused. Your stress management isn't optional.

What Helps:

Breathe (sounds dumb, but actually helps)
Remind yourself that you've handled crises before
Focus on what you can control (coordination, communication, support)
Accept that some things are outside your control
Remember that this will end

When It's Really Long:

Tag in another manager or senior leader to give yourself a break. You can't effectively lead on hour 8 of no sleep.

Managing your own stress while supporting your team is something I explored in Both Sides of the Desk: Burnout (Manager's Perspective) - you can't pour from an empty cup.

Communication During Crisis: What Actually Works

The Internal Updates

Your Team Needs:

Regular, honest updates about the organizational context that affects them.

Example:

"Quick update: Customers are impacted, and escalations are happening, but leadership understands we're working on it. No one is questioning your competence or effort. Focus on resolution, I'll keep handling the organizational side."

Why This Matters:

Your team is probably worried about consequences. Knowing that leadership isn't attacking them helps them focus.

The Leadership Updates

What Leadership Actually Wants:

Current status (even if it's "still broken")
Impact scope (how bad is this?)
What you're doing about it (even if it's "investigating")
When they'll hear from you next (predictability)

What They Don't Want:

Technical details they don't understand
Excuses
Silence
Surprises

The Format That Works:

Short, structured, frequent updates in business language.

Bad: "We're seeing BGP route flapping due to MTU mismatch in our MPLS core that's causing OSPF adjacencies to fail."

Good: "Network connectivity between sites is down. We've identified the technical cause and are implementing a fix. ETA 30 minutes."

The Customer Communication (If That's Your Role)

Some organizations have customer success or support handle this. If you're involved:

Be Honest:

Don't pretend it's not happening. Don't claim you're "investigating" when you actually know it's broken.

Be Specific About Impact:

"Our network infrastructure is experiencing an outage affecting [specific services]. We're working on a resolution."

Don't Promise What You Can't Deliver:

"We expect resolution within 30 minutes" when you have no idea, sets you up for failure.

Update Proactively:

Don't make customers ask for updates. Give them updates on whether the status has changed or not.

Managing the "Why Isn't It Fixed Yet?" Questions

You'll Get These:

From leadership, from other teams, from frustrated stakeholders who don't understand why it's taking so long.

The Response:

"We're working through systematic troubleshooting. Complex systems don't always have obvious failures. We've ruled out [X, Y, Z] and are investigating [A]. The alternative is guessing randomly, which would likely extend the outage."

What You're Doing:

Educating them on the reality of troubleshooting while defending your team's approach.

After Resolution: The Work Isn't Done

The Immediate Aftermath

Resolution happens. Systems are back. Crisis is over.

Your First Priority:

Thank your team. Publicly and specifically.

"Thank you all for the response. [Engineer] for identifying root cause. [Engineer] for implementing the fix. [Engineer] for managing vendor escalation. Excellent work under pressure."

Then:

"Everyone, get some rest. We'll do a full review tomorrow, but right now, take care of yourselves."

Don't:

Jump immediately into "what went wrong" or "how do we prevent this." Your team is exhausted. Give them recovery time.

Managing Leadership Expectations for Post-Mortem

Leadership often wants:

Immediate explanation of what happened
Guarantees it won't happen again
Someone to blame
Complete the post-mortem document by tomorrow

Your Job:

Set realistic expectations.

"We'll conduct a thorough post-mortem. That takes time to do properly. I'll have an initial summary by [reasonable timeframe], and a complete review by [timeframe]. The goal is learning and improvement, not blame."

Why This Matters:

Rushing the post-mortem produces shallow analysis and often results in blaming individuals rather than fixing systemic issues.

The Post-Mortem Process

Timeline:

Give your team at least 24-48 hours to recover before doing the post-mortem. Exhausted people don't remember details accurately or think clearly about systemic issues.

Who Should Participate:

People directly involved in the response
Subject matter experts for affected systems
You (as facilitator and leader)
Potentially a facilitator from outside the team (for objectivity)

The Ground Rules (State These Upfront):

This is a learning exercise, not a blame exercise
We're looking for systemic issues and process improvements
People made decisions with the information available at the time
Mistakes happen - what matters is what we learn
Everything discussed here is about improving systems and processes

The Framework:

Timeline of Events: What happened and when? (Factual, not interpretive)

Contributing Factors: What conditions allowed this to happen?

Technical factors (architecture, configuration, monitoring gaps)
Process factors (change management, testing, documentation)
Organizational factors (pressure, staffing, training)

What Went Well: Yes, really. What worked in your response?

Quick identification of the issue
Effective coordination
Good communication
Creative problem-solving

What Could Be Better: Where did the response have gaps?

Communication delays
Missing tools or access
Documentation gaps
Coordination challenges

Action Items: Specific, owned, measurable improvements

NOT: "Be more careful."
YES: "Implement automated testing for config changes before deployment - Owner: [Name], Due: [Date]."

The post-mortem process relates to managing technical debt - outages often reveal accumulated shortcuts and deferred maintenance, something I explored in Technical Debt: What Engineers Wish Managers Understood.

The Blameless Culture (And What That Actually Means)

"Blameless" doesn't mean "accountability-free."

It means:

We don't punish people for making mistakes during crisis response
We look for systemic issues that allowed mistakes to have a large impact
We assume people were doing their best with the available information
We focus on learning and prevention, not punishment

It doesn't mean:

Repeatedly making the same mistake is okay
Gross negligence has no consequences
Process violations are ignored
Individual growth and learning aren't addressed

The Example:

The engineer made a change that contributed to the outage. Blameless approach:

DON'T: "You caused the outage. This is your fault."

DO: "The change you made had unintended consequences. Let's talk about what testing could have caught this. What support do you need to prevent this in the future? What process changes would help?"

The Outcome:

The engineer learns, the process improves, and they're not paralyzed by fear of making future changes.

Managing the Engineer Who Feels Responsible

Even in blameless cultures, people blame themselves.

The 1-on-1 Conversation:

"Hey, I know you're probably replaying this in your head. Let me be clear: this wasn't about one person making one mistake. This was about systemic gaps that allowed a mistake to have this impact. Yes, your change was involved, but the real issues are a lack of testing, monitoring gaps, and an unclear change process. Those are organizational failures, not individual failures. I need you focused on learning and improvement, not beating yourself up."

What This Does:

Separates individual actions from systemic issues. Provides perspective. Prevents the guilt spiral that destroys confidence.

Follow Up:

Check in with them over the next few weeks. Major incidents can have a lasting psychological impact.

What Leadership Is Actually Evaluating

Here's what nobody tells you:

Leadership is watching how you handle outages. Not just whether they get resolved, but how you lead through them.

What They're Looking For:

1. Calm Under Pressure

Did you panic or did you lead? Your tone, communication, and decision-making under stress reveal your leadership capability.

2. Communication Competence

Did they have to chase you for updates, or did you proactively keep them informed? Could you translate technical issues into business impact?

3. Team Leadership

Did your team function effectively? Was the coordination good? Did people know what to do? That's a reflection of your leadership.

4. Post-Incident Learning

Are you extracting lessons and implementing improvements, or is the same thing going to happen again in 6 months?

5. Accountability Without Blame

Are you taking responsibility for your area while maintaining team morale and focusing on systemic improvement?

The Reality:

How you handle a major incident does more for your credibility (positive or negative) than months of normal operations.

Practical Tips That Actually Help

Before the Outage (Preparation)

Have a Plan:

Not a 50-page incident response document nobody reads. A simple, accessible plan:

How do we coordinate? (Teams channel? Conference bridge?)
Who gets notified? (Escalation paths)
Who communicates to leadership? (Usually you)
Where do we document? (Timeline tracking)

Practice:

Run incident response drills. Not full-scale disaster simulations - just "let's pretend there's an incident, how do we coordinate?"

Know Your Escalation Paths:

When do you wake up your boss? When do you engage vendors? When do you loop in other teams? Know this before 2 AM.

During the Outage

Document Everything:

Even rough notes. Timestamps. What did you try? What failed. What worked. You'll need this for the post-mortem.

Stay Hydrated and Fed:

Sounds basic, but exhausted, hungry people make worse decisions. Make sure your team (and you) are taking care of basic needs during long incidents.

Resist the "One More Thing" Temptation:

At hour 6 of an outage, someone will suggest, "Let's also try this completely different theory." Sometimes that's valid. Often, it's exhaustion that makes people desperate. Step back and assess whether the new direction makes sense or if you're just flailing.

After the Outage

Write the Timeline While Memory Is Fresh:

Don't wait a week. Capture what happened within 24-48 hours while people remember details.

Share Lessons Learned Widely:

Not just within your team. Share with other teams, leadership, and the organization. Learning compounds when it's shared.

Track Action Items Relentlessly:

Post-mortem action items have a way of disappearing into the backlog. Don't let that happen. These are your incident prevention investments.

The Bottom Line: It's Not If, It's When

Major outages will happen. That's not a failure - it's the reality of complex systems.

What matters is how you lead through them:

During the outage:

Stay calm and focused
Coordinate effectively
Communicate proactively
Protect your team from chaos
Make thoughtful decisions under pressure

After the outage:

Conduct blameless post-mortems
Extract real lessons
Implement actual improvements
Support team members who need it
Share learning across the organization

Your credibility as a manager is built (or destroyed) in these moments.

Not in the day-to-day operations when everything works.

In the crisis moments when everything is broken, and people are looking to you for leadership.

The good news: You get better at this. Each incident teaches you something if you're willing to learn.

The reality: You're going to make mistakes. Your first major incident as a manager probably won't go perfectly. That's okay. Learn from it and do better next time.

The responsibility: Your team is counting on you to lead well during these moments. To stay calm when they're stressed. To make good decisions under pressure. To protect them from organizational chaos. To ensure they learn without being blamed.

That's what leadership actually looks like.

Not the meetings and one-on-ones and performance reviews.

The moments when everything is on fire, and you have to lead people through it.

📧 Want monthly insights on technical leadership and management? Subscribe to my newsletter for practical perspectives on leading through challenges, building resilient teams, and the real lessons from the trenches. First Tuesday of every month. Sign up here

How do you handle major incidents as a manager? What's worked for your team? What lessons did you learn the hard way? Share your experiences in the comments or connect with me on LinkedIn - we're all learning this together.

Blog