The Recent Outages: Context, Not Outrage

Let's address the elephant in the room: Both AWS and Azure had significant outages recently, and if you're a manager responsible for infrastructure decisions, you've probably heard some version of "this is why we shouldn't trust the cloud" from someone on your team or in leadership.

Here's what actually happened: major cloud providers experienced service disruptions that impacted thousands of customers. Services went down. Businesses lost money. Engineers scrambled. It wasn't good.

But before we use this as ammunition in the cloud vs. on-prem debate, let's add some context.

Cloud providers are transparent about their outages. They publish detailed post-mortems. They show you exactly what broke and why. When was the last time your on-prem infrastructure came with that level of visibility into failures?

Outages happen everywhere. Your on-prem data center isn't immune to failures. The difference is that when AWS goes down, it makes headlines. When your local data center has issues, it's just a really bad Tuesday that nobody outside your organization hears about.

Scale matters for context. AWS and Azure run at a scale most organizations will never approach. They're handling edge cases and failure scenarios that your three-server rack will never encounter. That doesn't excuse outages, but it does mean comparing their uptime to your on-prem environment isn't apples-to-apples.

Here's my point: using recent cloud outages as the primary reason to avoid cloud migration is like refusing to fly because you heard about a plane crash while ignoring the daily car accidents on your commute. The question isn't "is the cloud perfect?" - it's "what's the right infrastructure strategy for your specific situation?"

Why This Isn't a Simple Cloud vs. On-Prem Debate

The problem with how this conversation usually goes:

Someone proposes moving to the cloud. Someone else points to recent outages and says, "See, we should stay on-prem." Then it becomes a philosophical debate about control, reliability, and trust instead of a practical discussion about business needs.

This framing is broken for several reasons.

First, "the cloud" isn't one thing. AWS, Azure, Google Cloud, Oracle Cloud, and smaller regional providers - they all have different architectures, SLAs, availability zones, and failure modes. Saying "the cloud is unreliable" because AWS had an outage is like saying "cars are unreliable" because one manufacturer had a recall.

Second, on-prem isn't one thing either. Are we talking about a state-of-the-art data center with redundant everything, or a server closet that doubles as storage for old office furniture? The reliability of on-prem infrastructure varies wildly based on investment, expertise, and maintenance.

Third, this ignores hybrid approaches. Most mature organizations aren't choosing between "100% cloud" or "100% on-prem." They're running hybrid environments where different workloads live in different places based on their specific requirements.

Fourth, availability isn't the only factor. Cost, scalability, skill requirements, security, compliance, disaster recovery, and operational complexity all matter. Making infrastructure decisions based solely on uptime percentages misses most of what actually impacts your business.

The real question isn't "cloud or on-prem?" The real question is "where should each workload live to best balance reliability, cost, capabilities, and operational reality?"

The Real Questions Managers Should Ask

Instead of debating cloud philosophy, here are the practical questions that actually matter when making infrastructure decisions:

What's Your Blast Radius Tolerance?

The Question: When something goes wrong, how much of your business can you afford to be impacted at once?

Why This Matters: Cloud outages tend to be "big bang" events. When an AWS region goes down, everything in that region goes down together. On-prem failures are often more localized - one server, one storage array, one network segment.

The Trade-off: Cloud gives you geographic distribution and redundancy that's expensive to build on-prem. But it also means you're sharing fate with every other customer in that region. Your blast radius might be smaller on-prem (one application fails) or larger in the cloud (entire region fails), but you have better tools for multi-region redundancy.

Questions to Ask Your Team:

Can we architect for multi-region redundancy if needed?
What's our RTO (Recovery Time Objective) for different services?
Do we have the expertise to build and maintain geographic redundancy on-prem?
What's the actual business impact of different failure scenarios?

Real Talk: If you're running critical services in a single AWS region with no failover plan, you're not really using the cloud's redundancy capabilities - you're just paying someone else to host your single point of failure. The cloud gives you options for resilience, but you have to actually architect for them.

What's Your Team's Skillset?

The Honest Assessment: Look at your current team. What are they actually good at? What do they enjoy working on? What skills are you realistically going to be able to develop or hire for?

Cloud Skills Are Different: Managing cloud infrastructure requires different expertise than managing on-prem infrastructure. It's not just "networking but in AWS" - it's infrastructure as code, cloud-native architectures, understanding service-specific limitations, managing costs through automation, and constantly learning new services.

The On-Prem Reality: Maintaining on-prem infrastructure requires deep knowledge of physical hardware, data center operations, vendor relationships, capacity planning, and lifecycle management. If your team has spent 15 years building this expertise, throwing it away for the cloud might not make sense.

Questions to Ask:

Does my team want to learn cloud technologies, or are they interested in deepening on-prem expertise?
Can we attract and retain people with the skills we need for our chosen direction?
What's the learning curve cost in terms of mistakes, time, and team stress?
Do we have the capacity to manage both environments during a transition?

The Uncomfortable Truth: Sometimes the right technical decision (cloud migration) conflicts with your team's reality (people who are great at on-prem but don't want to become cloud engineers). You can't ignore the human factor in infrastructure decisions. I discuss this a bit in my What I Look for When Hiring Network Engineers post.

I've seen organizations force cloud migrations on teams that fought it every step of the way. The result? Half-implemented migrations, ongoing operational issues, and eventual team turnover. The "right" technical decision failed because it ignored team dynamics.

What's Your Budget Reality?

The Cloud Cost Myth: "Cloud is cheaper than on-prem" is marketing, not truth. Cloud can be cheaper for certain workloads and usage patterns. It can also be spectacularly more expensive.

Where Cloud Costs Less:

Variable workloads (spin up resources when needed, shut down when not)
Rapid scaling requirements (grow from 10 to 1000 servers quickly)
Avoiding capital expenditure (opex vs. capex accounting)
Eliminating data center overhead (power, cooling, physical security, staff)
Development and test environments (create and destroy frequently)

Where Cloud Costs More:

Steady-state workloads running 24/7 (paying a premium for flexibility you don't use)
High data egress (moving data out of the cloud is expensive)
Storage-intensive applications (cloud storage costs add up fast)
Lack of cost optimization (easy to overprovision and forget)
Complexity tax (managing cloud costs requires tools and expertise)

The Real Budget Question: Can you afford the upfront capital expense of on-prem infrastructure? Can you afford the ongoing operational cost of cloud? Can you afford the expertise required for whichever direction you choose?

What Managers Often Miss: Cloud shifts spending from lumpy capital expenses to steady operational expenses. This impacts budgeting, forecasting, and how you're measured financially. Make sure you understand your organization's financial preferences before deciding. Check out my Making the Business Case for Network Modernization blog for more on this piece!

What's Your Business Criticality?

Not Everything Matters Equally: Your email server and your revenue-generating e-commerce platform have different criticality levels. Your internal wiki and your customer-facing application have different uptime requirements.

The Criticality Matrix:

High Criticality + High Change Rate: Consider cloud. You need both reliability and agility. Cloud gives you tools for rapid iteration with built-in redundancy options.

High Criticality + Low Change Rate: Consider on-prem or colo. These stable, critical workloads are where on-prem shines - predictable performance, full control, and potentially lower cost.

Low Criticality + High Change Rate: Cloud is probably better. These are your experimental, development, or internal tools that benefit from flexibility without needing ultimate reliability.

Low Criticality + Low Change Rate: Honestly? Either works. Choose based on operational convenience and cost.

The Compliance Factor: Some industries have regulatory requirements that practically mandate on-prem or private cloud. Healthcare, finance, government - these often have constraints that override purely technical considerations.

Questions to Ask:

What's the actual business impact of downtime for each application?
Which applications are stable vs. rapidly evolving?
What compliance or regulatory requirements apply?
Where do we need to innovate quickly vs. maintain stability?

Decision Framework: When Cloud, When On-Prem, When Hybrid

After working through those questions, here's a practical framework for making infrastructure decisions:

Choose Cloud When:

You need rapid scalability. Your business is growing fast or has unpredictable traffic patterns. Building that scalability on-prem is expensive and slow.

You want to focus on applications, not infrastructure. You'd rather have your team building features than maintaining servers. Cloud lets you offload infrastructure management (at a cost premium).

You're building new, cloud-native applications. Starting fresh? Cloud-native architecture gives you access to managed services, serverless options, and modern development practices.

You need geographic distribution. You have global customers and want low-latency access everywhere. Cloud makes multi-region deployment feasible for normal-sized companies.

You have variable workloads. Development environments, seasonal traffic, batch processing - workloads that aren't 24/7 can see real cost savings in the cloud.

Choose On-Prem When:

You have stable, predictable workloads. If your applications run at consistent capacity 24/7 for years, on-prem economics often win over time.

You have existing infrastructure investment. If you just bought new servers and storage, sweating those assets might make more financial sense than cloud migration.

Your team has deep on-prem expertise. Your team knows on-prem inside and out, likes working on it, and you can attract talent with those skills.

You have specific compliance requirements. Some regulatory environments effectively require on-prem or private cloud infrastructure.

You have high data volumes with low egress needs. If you're storing and processing massive amounts of data internally, cloud storage costs can be prohibitive.

Choose Hybrid When:

You're in transition. Most cloud migrations take years. Hybrid is often the reality during migration, not a permanent strategy.

Different workloads have different needs. Your stable ERP system might live on-prem while your customer-facing web apps run in the cloud.

You need disaster recovery options. On-prem primary with cloud backup (or vice versa) gives you geographic redundancy without full migration.

You want flexibility. Hybrid lets you move workloads between environments based on cost, performance, or changing requirements.

The Warning: Hybrid sounds great in theory, but is operationally complex. You're maintaining expertise in multiple environments, managing connectivity between them, handling security across boundaries, and dealing with the worst of both worlds operationally. Don't choose hybrid as a way to avoid making decisions - choose it when it genuinely serves specific workloads better.

What I'm Doing in My Organization

Here's where we are:

We're currently in the evaluation phase of cloud migration. Not "should we move to the cloud?" but "what should move to the cloud, when, and why?"

The Process:

First, we inventoried everything. Every application, every server, every service. For each one, we documented:

Business criticality
Current resource utilization
Dependencies on other systems
Compliance or security requirements
Change frequency
Team expertise required

Second, we assessed our team. Honestly looked at our current skillsets, learning appetite, and capacity for change. Some people are excited about the cloud. Some are deeply skeptical. Some just want stability regardless of where they live. This type of assessment can be tough. Check out my blog on 5 Things I Wish I Knew Before Becoming a Manager for more context on this

Third, we're building business cases for specific workloads. Not "move everything to cloud" but "move this development environment to AWS because X, Y, Z." Each workload gets evaluated on its own merits.

What We're Finding:

Some things are clear cloud candidates:

Development and test environments (we're wasting money on idle hardware)
New customer-facing applications (cloud-native from the start)
Disaster recovery (cloud backup is cheaper than maintaining a second data center)

Some things are staying on-prem for now:

Core ERP system (stable, predictable, and we just refreshed the hardware)
High-volume data processing (egress costs would kill us)
Applications with specific compliance requirements our team understands in our current environment

The Likely Outcome:

We're headed toward hybrid. Not because it's trendy, but because different workloads genuinely have different optimal homes. Our stable, predictable core infrastructure stays on-prem. Our variable, customer-facing, and development workloads move to the cloud.

The Timeline:

This is a multi-year process. We're not rushing it. We're learning cloud technologies through small, non-critical migrations first. Building expertise. Making mistakes on things that don't matter much. Then, tackling bigger migrations when we actually know what we're doing. Learning patience and long term project nuance is something I touch on in my From Network Engineer to Network Engineering Manager post.

What I'm Learning:

The AWS and Azure outages didn't change our strategy because our strategy isn't based on "cloud never fails." It's based on "which workloads benefit from cloud capabilities, and how do we architect for resilience regardless of where things run?"

If we put something in AWS, we're building it to survive a region failure. If we keep something on-prem, we're building it with redundancy in our infrastructure. The question isn't "which platform is more reliable?" It's "what level of reliability do we need, and how do we architect for it?"

The Bottom Line

Should you move to the cloud after recent AWS and Azure outages?

The real answer: It depends on your specific workloads, team capabilities, budget reality, and business requirements. Recent outages are data points, not decisions.

What doesn't help: Using cloud outages as emotional ammunition in infrastructure religious wars. "See, cloud is bad!" or "On-prem would have failed too!" Neither helps you make good decisions.

What does help:

Understanding your blast radius tolerance for different workloads
Honestly assessing your team's capabilities and learning capacity
Calculating real costs for your actual usage patterns
Prioritizing applications by business criticality and change requirements
Making workload-by-workload decisions instead of all-or-nothing commitments

For Fellow Managers:

Don't let recent headlines pressure you into hasty decisions in either direction. Cloud outages don't mean you should cancel migration plans. They also don't mean you should ignore reliability concerns.

Your job is to make infrastructure decisions based on your organization's specific reality, not on what works for companies with completely different requirements, resources, and constraints.

And honestly? Most of us are going to end up in hybrid environments, whether we plan for it or not. The question is whether we get there through intentional strategy or through chaos.

I'm still figuring this out myself. Four months into management, I'm learning that infrastructure decisions are less about technology and more about understanding business needs, team capabilities, and organizational readiness for change.

What's your experience with cloud migration decisions? Are you moving forward, staying put, or somewhere in between? I'd love to hear what's working (or not working) for others navigating this.

Should You Move to the Cloud? A Manager's Perspective After the AWS and Azure Outages

The Recent Outages: Context, Not Outrage

Why This Isn't a Simple Cloud vs. On-Prem Debate