Your Engineer Made a Mistake That Cost Money: Now What?

Apr 14

The Call You Never Want to Get

It's 2:47 PM on a Tuesday. Your phone rings. It's your senior engineer, and you can hear the stress in their voice before they even speak.

"We have a problem. The e-commerce site is down. Customer-facing applications are offline. It's affecting all regions."

Your stomach drops. "What happened?"

"Configuration change. We're working on it."

"How long?"

"We've been down for 20 minutes. Working on rollback now."

You open your laptop. The alerts are everywhere. Your email is filling with automated notifications. Slack is exploding with messages from various teams asking what's happening.

45 minutes later, service is restored.

Total outage time: 65 minutes. The e-commerce site is down during peak afternoon traffic. Customer support is drowning in calls. Estimated revenue impact: $180,000. Reputational damage: harder to quantify but real.

The post-mortem reveals what happened:

A junior engineer was implementing a routing change. They made a configuration error that created a routing loop. The change went through testing in the lab (where it worked fine because the lab didn't replicate a specific edge case in production). When deployed to production, it brought down a critical path.

It's 5 PM. The engineer who made the mistake is sitting outside your office. They look like they're about to be fired. Or quit. Or both.

What you do in the next 30 minutes will determine:

Whether that engineer learns and grows or becomes paralyzed by fear
Whether your team feels safe enough to be honest about mistakes or starts hiding problems
Whether you build a culture of accountability or a culture of blame
Your credibility as a leader who handles difficult situations well

This is one of the hardest moments in management.

Let's talk about how to actually handle it - what works, what makes it worse, and how to balance accountability with psychological safety.

The First 30 Minutes: What NOT to Do

Before talking about what works, let's be clear about what makes this worse:

Mistake 1: The Immediate Blame Response

What it looks like:

"What were you thinking? How did you let this happen? Do you realize how much this costs us?"

Why it's destructive:

The engineer already knows they messed up. They're probably devastated. Piling on blame adds nothing productive.

What it creates:

Fear of taking any action in the future
Defensive behavior instead of honest reflection
Reluctance to admit future mistakes (they'll hide problems)
Damaged relationship with you as their manager
Team members watching and learning that mistakes = attacks

The reality:

You're venting your own stress and frustration. That's human, but it's not leadership.

Mistake 2: Dismissing It as "No Big Deal"

What it looks like:

"Don't worry about it. These things happen. It's fine."

Why it's wrong:

It's NOT fine. $180K in lost revenue isn't "no big deal." Dismissing it minimizes the seriousness and prevents real learning.

What it creates:

Confusion about what's actually important
Impression that you don't take serious mistakes seriously
No accountability for impact
Missed opportunity for growth

The balance:

You can be supportive without minimizing the impact.

Mistake 3: The Public Callout

What it looks like:

In a team meeting or on Slack: "This outage happened because [engineer] made a configuration error."

Why it's destructive:

Public blame destroys psychological safety. People won't take risks. They'll hide mistakes. Innovation dies.

What happens:

The engineer is humiliated
The team learns not to try anything challenging
Future mistakes get hidden until they become catastrophic
You look like a manager who throws people under the bus

The rule:

Praise publicly, address mistakes privately.

Mistake 4: Focusing Only on the Individual

What it looks like:

"You made a mistake. You need to be more careful. This can't happen again."

Why it's incomplete:

Mistakes rarely happen in a vacuum. There are usually systemic issues that allow an individual error to have such a large impact.

What gets missed:

Why didn't testing catch this?
Why was one person's mistake able to cause a complete outage?
What process failures enabled this?
What can the organization do differently?

The better approach:

Understand the individual's actions within the systemic context.

Mistake 5: No Conversation at All

What it looks like:

Avoiding the conversation entirely. Hoping it just blows over. Never address it directly with the person.

Why it fails:

The engineer doesn't know where they stand
No learning happens
Ambiguity creates anxiety
No clarity on what needs to change
Sets poor precedent for how mistakes are handled

The necessity:

This conversation needs to happen. How you have it matters enormously.

Having difficult conversations is fundamental to management, something explored in Delegation for Control Freaks - avoiding hard conversations doesn't make them go away.

What Actually Works: The Framework

Here's a framework for handling this conversation that balances accountability with growth:

Step 1: Check Your Own Emotional State First

Before you have the conversation:

Take 15 minutes. Process your own stress and frustration. Don't have this conversation while you're still angry about the outage.

Ask yourself:

Am I calm enough to have a productive conversation?
Am I approaching this as a developmental moment or a punishment?
What outcome do I actually want here?

The outcome you should want:

An engineer who learns from this improves their practices and becomes more careful, not an engineer who's paralyzed by fear or looking for a new job.

Step 2: Start With Concern, Not Blame

How to open:

"Thanks for coming in. I know today was really difficult. How are you doing?"

Not:

"We need to talk about what happened."

Why this works:

You're acknowledging them as a human being who just had a really bad day. You're starting with empathy.

What they might say:

"I feel terrible. I can't believe I caused that."

Your response:

"I can see that. Let's talk through what happened and figure out how to move forward."

Step 3: Understand What Happened (Facts First)

Ask them to walk through it:

"Walk me through what you were trying to do and what happened."

Listen without interrupting. Let them tell the full story.

What you're looking for:

Their understanding of what went wrong
Whether they understand the impact
What their thinking was at the time
Whether there were process failures or just individual errors

This isn't interrogation - it's understanding.

Step 4: Acknowledge the Impact Honestly

Don't minimize:

"This had a significant impact. The outage cost approximately $180K in lost revenue. Customer support was overwhelmed. Other teams were affected. This was serious."

Then immediately add:

"And I can see you understand that. You don't need me to pile on - you're already feeling the weight of it."

Why both parts matter:

Acknowledging impact = accountability

Acknowledging they already know = empathy

You're not letting them off the hook, but you're also not crushing them.

Step 5: Separate the Person from the Mistake

What to say:

"You made a mistake. That doesn't define you as an engineer. I still have confidence in your abilities. We're going to figure out how to prevent this from happening again."

Why this matters:

People catastrophize after major mistakes. They think: "I'm a terrible engineer. I shouldn't be doing this job."

Your job is to provide perspective:

This was a bad mistake. It doesn't mean they're a bad engineer.

Step 6: Focus on Systems, Not Just Individual Behavior

The question to ask:

"Beyond being more careful, what systemic changes would have prevented this?"

This shifts from blame to improvement:

Should testing have caught this? (Testing process improvement)
Should someone have reviewed the change before implementation? (Peer review process)
Should deployment have been phased? (Deployment process)
Should monitoring have caught it faster? (Monitoring improvement)
Should there be safeguards preventing this type of error? (Architectural improvement)

What this does:

Shows that mistakes often reveal systemic weaknesses, not just individual failure.

The Accountability Conversation

Now comes the harder part - making sure this doesn't happen again without creating fear:

Distinguishing Between Mistake Types

Not all mistakes are the same. How you respond should depend on what kind of mistake this was:

Type 1: Honest Mistake Despite Following Process

What it looks like:

The engineer followed the proper process. Tested the change. Had it reviewed. Deployed carefully, but there was an edge case nobody anticipated.

How to handle:

Focus entirely on systemic improvements. This isn't about individual accountability - it's about improving processes to catch edge cases.

The conversation:

"You followed the process. This edge case wasn't something we anticipated. That's a gap in our testing, not a gap in your work. Let's talk about how to improve testing to catch cases like this."

Type 2: Mistake Due to Rushing or Carelessness

What it looks like:

The engineer knew the proper process but skipped steps. Didn't test thoroughly. Rushed because of time pressure. Made assumptions instead of verifying.

How to handle:

This requires direct conversation about behavior change.

The conversation:

"Help me understand why you skipped [step]. What was the pressure or thinking that led to that decision?"

Listen to their answer. There might be systemic pressures (unrealistic timelines, understaffing) that contribute to rushing.

Then:

"Going forward, I need you to commit to following the full process even when there's time pressure. If the timeline doesn't allow proper process, we escalate and adjust the timeline - we don't skip critical steps."

Type 3: Repeated Mistakes of the Same Type

What it looks like:

This isn't the first time they've made this type of error. Pattern of similar mistakes.

How to handle:

This is where accountability gets more serious.

The conversation:

"This is the third time we've had an issue with [specific pattern]. We talked about this after the last incident. What's preventing the change we discussed from happening?"

This might reveal:

They need more training
They need closer supervision
They're overwhelmed and need a workload adjustment
They're not taking this seriously enough

The harder truth:

If repeated mistakes continue after clear conversations and support, this might be a performance issue requiring a formal performance improvement plan or other consequences.

Type 4: Violation of Clear Policy or Safety Rule

What it looks like:

The engineer violated a clear, known rule designed to prevent exactly this type of problem.

How to handle:

This is most serious and might require formal disciplinary action.

The conversation:

"The rule about [policy] exists specifically to prevent this type of incident. You were aware of this rule. Help me understand why you chose not to follow it."

Depending on severity and circumstances, this might involve HR.

What to Ask For Going Forward

After understanding what happened and acknowledging the impact, focus on the future:

The Commitment You Need

"Here's what I need from you going forward:"

1. Specific behavior change:

"I need you to commit to [specific action - thorough testing, peer review, whatever was missed] on every change, even when the timeline is tight."

2. Proactive communication:

"If you're ever uncertain about a change or feel pressured to skip steps, I need you to come to me before making that call."

3. Contributing to process improvement:

"I'd like you to help develop the improved testing process we discussed. You understand this failure mode better than anyone now."

Why #3 matters:

Turning them into part of the solution gives them agency and demonstrates trust.

The Support You're Providing

"Here's what I'm going to do:"

1. Process improvements:

"I'm going to work on [specific process improvement] so this type of error gets caught before production."

2. Training or resources:

"If you need additional training on [area], let's get that scheduled."

3. Continued trust:

"I'm not taking you off critical work. I still trust you. But I'm going to check in more frequently for the next few weeks as you rebuild confidence."

Why this matters:

Shows you're not punishing them by taking away their responsibility. You're supporting their growth.

The Team Conversation

After the private conversation, you need to address this with the broader team (without the individual present):

What to Say in the Team Meeting

Don't:

"This outage happened because [person] made a mistake."

Do:

"Yesterday's outage was caused by a configuration error that made it through our testing process. We've identified several process improvements:

Enhanced testing to catch this class of errors
Additional peer review on routing changes
Improved monitoring to detect this failure mode faster

These improvements will reduce the likelihood of similar incidents regardless of who's implementing changes."

What this does:

Addresses the incident without naming individuals
Focuses on systemic improvements
Shows the team that mistakes lead to learning, not blame
Sets the cultural tone

Building a culture where mistakes lead to learning aligns with Managing Your Team Through a Major Outage: blameless post-mortems are critical to improvement.

Creating Psychological Safety While Maintaining Accountability

The balance:

Psychological safety: People feel safe admitting mistakes, asking questions, and raising concerns.

Accountability: People take responsibility for impact and commit to improvement.

These aren't opposites - you need both.

What psychological safety looks like:

Engineer reports: "I made a change, and now I'm seeing weird behavior. I'm not sure if it's related, but I wanted to flag it."
Response: "Good call raising it. Let's investigate together."

What accountability looks like:

Same engineer, after investigation: "It was my change. Here's what went wrong."
Response: "Okay, let's understand what happened and how to prevent it. What did you learn?"

What kills psychological safety:

Punishing people for honest mistakes or for raising concerns early.

What kills accountability:

Never having direct conversations about impact or never asking for behavior change.

When Mistakes Reveal Bigger Problems

Sometimes an individual mistake exposes organizational dysfunction:

The Pressure to Rush

What you discover:

Engineers rushed because they were told the change had to be done by the end of the day. No time for proper testing.

The real problem:

Unrealistic timelines are being set without consulting the people doing the work.

Your job as manager:

Address the timeline pressure with leadership. "We're creating conditions where people feel they have to choose between doing it right and doing it fast. That creates incidents."

The Knowledge Gap

What you discover:

The engineer didn't know the proper process existed or didn't understand why certain steps mattered.

The real problem:

Training gap or unclear documentation.

Your job:

Fix the training and documentation, not just blame the person for not knowing.

The Understaffing Problem

What you discover:

The engineer was handling too many things at once. Overloaded. Made a mistake due to cognitive overload.

The real problem:

The team is underwater. Mistakes will continue until the workload is addressed.

Your job:

Advocate for resources or a reduced scope. You can't prevent mistakes by telling exhausted, overloaded people to "be more careful."

The Process That Doesn't Work

What you discover:

The "Official process" is so cumbersome that everyone bypasses it. The engineer used the workaround everyone uses, which has gaps.

The real problem:

Process is broken. People are finding workarounds because the official process is unusable.

Your job:

Fix the process so following it is the path of least resistance, not something to work around.

The Follow-Up Weeks

The conversation doesn't end after 30 minutes. Here's what happens next:

Week 1: Frequent Check-Ins

What to do:

Check in daily or every other day.

"How are you doing? How are you feeling about work right now?"

Why:

They might be spiraling. Questioning whether they should be in this role. You need to provide stability and reassurance.

Watch for:

Excessive caution (paralyzed by fear)
Risk avoidance (not wanting to touch anything)
Confidence issues
Withdrawal from the team

Address these directly if you see them.

Week 2-4: Monitoring and Support

What to do:

Continue regular check-ins, but less frequently.

Ensure they're implementing the changes you discussed.

Watch for both overcorrection (too cautious) and undercorrection (not taking it seriously enough).

Give them meaningful work:

Don't sideline them. Give them real work. Show continued trust.

Month 2-3: Return to Normal

What to do:

Return to normal check-in cadence.

Acknowledge growth: "I've seen real improvement in [specific behavior]. That's exactly what I was hoping for."

Close the loop:

"I consider this incident closed. You've learned from it, we've improved processes, and I have full confidence in you going forward."

Why this matters:

Explicit closure prevents this from hanging over them indefinitely.

When the Same Mistake Happens Again

This is the hard scenario:

You had the supportive conversation. You provided resources. You clarified expectations.

And it happens again.

Now what?

The Second Conversation Is Different

First time: Developmental. Supportive. Learning-focused.

Second time: More serious. Pattern discussion. Potential performance issue.

What to say:

"We're here again. Same type of issue. We talked about this after the last incident. We put support in place. Help me understand what's happening."

Listen to their explanation.

Then:

"I need to be direct: this pattern can't continue. We've provided support and clarity on expectations. If this happens again, we'll need to have a formal performance discussion and consider whether this role is the right fit."

Why this is necessary:

Repeated mistakes after clear support and expectations suggest either:

They're not capable of the role
They're not taking it seriously
There's something else going on (personal issues, burnout, etc.)

Any of these needs to be addressed directly.

When Termination Might Be Necessary

Nobody wants to fire someone over a mistake. But sometimes repeated mistakes despite support indicate it's not working.

Signs it might be time:

Multiple serious incidents despite clear conversations and support
No improvement in behavior or practice
Other performance issues beyond just mistakes
Resistance to feedback or accountability

Before termination:

Document everything
Involve HR
Ensure you've provided clear expectations and support
Consider whether a formal performance improvement plan is warranted first

The hard truth:

Sometimes the kindest thing is acknowledging that this role isn't the right fit and helping someone find work that better matches their capabilities.

What This Reveals About You as a Manager

How you handle mistakes reveals your leadership character:

Managers Who Blame

What they do:

Focus on the fault. Make examples of people. Create fear.

What they get:

People who hide mistakes
Risk-averse teams
Lack of innovation
High turnover
Culture of CYA

Managers Who Ignore

What they do:

Minimize mistakes. Avoid hard conversations. Never establish accountability.

What they get:

Repeated mistakes
Lack of learning
Unclear expectations
Resentment from high performers who see low performance being tolerated

Managers Who Develop

What they do:

Address mistakes directly but developmentally. Balance accountability with support. Focus on systems and growth.

What they get:

Teams that learn and improve
Psychological safety with accountability
Engineers who take intelligent risks
Culture of continuous improvement
Trust and respect

Which manager are you?

The mistake isn't the defining moment - how you handle it is.

The Bottom Line: Mistakes Are Leadership Opportunities

Here's what becomes clear after handling these situations multiple times:

Mistakes will happen. You manage humans implementing complex changes in production environments. Perfection is impossible.

How you handle mistakes defines your culture more than anything else you do.

If you blame and punish, people hide problems until they're catastrophic.

If you dismiss and ignore, people don't learn, and mistakes repeat.

If you address developmentally, people learn, grow, and take smart risks.

The framework that works:

Check your own emotional state before the conversation
Start with empathy, not blame
Understand what happened factually
Acknowledge impact honestly
Separate the person from the mistake
Focus on systemic improvements, not just individual behavior
Be clear about what needs to change
Provide support for that change
Follow up over the weeks to ensure change happens
Close the loop explicitly when it's resolved

The balance:

Psychological safety (people feel safe admitting mistakes) + Accountability (people take responsibility and change behavior) = a High-Performing team that learns and improves.

What repeated mistakes after support might mean:

Not the right role, not taking it seriously, or external factors affecting performance. Address directly.

Your job isn't preventing every mistake.

Your job is to create an environment where:

Mistakes are caught early
When they happen, people are honest about them
Learning happens systematically
People grow from the experience
The same mistakes shouldn't be repeated

The engineer is sitting outside your office, waiting to talk about their mistake.

How you handle the next 30 minutes will show them - and everyone watching - what kind of leader you are.

Make it count.

📧 Building a high-performing team with psychological safety and accountability? Subscribe to my monthly newsletter for practical perspectives on technical leadership, handling difficult situations, and building cultures where people learn and grow. First Tuesday of every month. Sign up here

How do you handle it when your engineer makes a costly mistake? What's worked for you? What have you seen go wrong? Share your experiences in the comments or connect with me on LinkedIn - we've all been on both sides of this conversation.

Pat Allen

Your Engineer Made a Mistake That Cost Money: Now What?

The Call You Never Want to Get

The First 30 Minutes: What NOT to Do

Mistake 1: The Immediate Blame Response

Mistake 2: Dismissing It as "No Big Deal"

Mistake 3: The Public Callout

Mistake 4: Focusing Only on the Individual

Mistake 5: No Conversation at All

What Actually Works: The Framework

Step 1: Check Your Own Emotional State First

Step 2: Start With Concern, Not Blame

Step 3: Understand What Happened (Facts First)

Step 4: Acknowledge the Impact Honestly

Step 5: Separate the Person from the Mistake

Step 6: Focus on Systems, Not Just Individual Behavior

The Accountability Conversation

Distinguishing Between Mistake Types

Type 1: Honest Mistake Despite Following Process

Type 2: Mistake Due to Rushing or Carelessness

Type 3: Repeated Mistakes of the Same Type

Type 4: Violation of Clear Policy or Safety Rule

What to Ask For Going Forward

The Commitment You Need

The Support You're Providing

The Team Conversation

What to Say in the Team Meeting

Creating Psychological Safety While Maintaining Accountability

When Mistakes Reveal Bigger Problems

The Pressure to Rush

The Knowledge Gap

The Understaffing Problem

The Process That Doesn't Work

The Follow-Up Weeks

Week 1: Frequent Check-Ins

Week 2-4: Monitoring and Support

Month 2-3: Return to Normal

When the Same Mistake Happens Again

The Second Conversation Is Different

When Termination Might Be Necessary

What This Reveals About You as a Manager

Managers Who Blame

Managers Who Ignore

Managers Who Develop

The Bottom Line: Mistakes Are Leadership Opportunities

When Your "Quick Win" Becomes a Disaster: Recovering From Failed Initiatives Without Destroying Your Credibility

The Talent Pipeline AI Is Destroying: Where Do Senior Network Engineers Come From in 10 Years?