Rethinking Resilience: Three Steps To Save Millions Of Dollars

ericwoodell
Jan 23, 2024
9 min read

Updated: Apr 4, 2024

As an Executive, have you ever asked yourself why the quest for resiliency is leading to a redundancy “arms race,” where organizations build multiple layers of failover capability in the form of additional data centers, whether owned or rented via the Data Centers as a Service (DCaaS) market? Are you wasting millions of dollars on it?

Wouldn’t it be amazing if you didn’t need to spend all that money and still get your resilience?!? Do you want to know how through improved maintenance practices, organizations can achieve a more sustainable balance that not only reduces complexity but also enhances overall resilience?

If your answer is yes, read every word in this article and be amazed by how simple the solution really is.

In this article I will examine the crux of IT resiliency and demonstrate that a lack of proper maintenance is THE significant driver behind the unnecessary complexity and decreased sustainability in data center operations.

The Strategy of Over-Redundancy of Data Centers as the Catch-All Solution To Resiliency Is Misguided

The current landscape of enterprise IT uses the principle of redundancy as a key strategy to achieve system resiliency.

The theoretical logic is straightforward: by duplicating critical components, systems or even entire sites, an organization can ensure that if one part fails, another can take over and maintain service continuity.

Airbus A380

Imagine being are a passenger on Airbus A380. It is designed to fly safely even if one of its four engines fails. This is possible due to the aircraft's design, which allows the remaining engines to compensate for the loss of one, ensuring the plane can bring you safely to your destination.

In the same way, IT resiliency is about creating systems that can continue to operate even when parts fail. Just as your A380 doesn’t need all four engines to fly, a resilient IT system doesn’t need every component to be functional at all times to maintain service. In both scenarios engineers recognize that failures will occur, and so they include in the design the capability to endure failures without catastrophic outcomes.

Now think about it, the majority of organizations, including probably yours, use redundancy as a quick fix to mitigate the risks associated with system or building failures.

In many cases, the strategy of many redundant data centers becomes the de facto approach because it offers a tangible form of risk mitigation that can be easily understood and communicated to stakeholders.

It's an old and proven strategy.

And guess what, when the redundancy strategy goes too far, driven by the imperative to keep systems running at all costs, especially given the high stakes of downtime in terms of lost revenue, customer trust, and market reputation, that strategy ends up causing more problems than it solves.

When Redundancy Goes TOO Far...

We all know that with every new data center comes a significant environmental and economic cost. Data centers use massive amounts of electrical power and water; their exponential growth is unsustainable from a resource consumption standpoint. Also, from a carbon footprint perspective, the continual power usage for maintaining operational readiness, the embodied carbon in the manufacturing and maintenance of the equipment, and the eventual disposal of electronic waste are critical factors. The energy sources powering these sites are also a major consideration; reliance on non-renewable energy sources significantly increases the carbon footprint.

And now, let’s be honest, the biggest driver of enterprise IT organizations to overbuild, is the fear of failure of a data center or multiple data centers within a geographic region. This fear is driving the trend among those organizations to have excessive redundancy, a “more is better” approach, if you will.

But WHY would we add more and more engines to an airplane than it is designed to handle? Each new engine (or in the case of IT, each new redundant system) only adds more and more complexity. With IT, this means additional data centers, more failover mechanisms and complex disaster recovery plans.

Thus our biggest reason to rethink resilience is “The Paradox of Complexity”

Excessive complexity itself becomes a cause of failures. Put another way, when systems are overly intricate, it becomes increasingly challenging to predict how they will behave in unforeseen situations. Troubleshooting and debugging become more complex, potentially leading to prolonged downtime. So while the intention behind complexity is to enhance resilience, it can, if unchecked, result in fragility.

How many examples have you recently seen of this very thing, such as recent failures in the banking transaction systems, failures of cloud providers, or Australia’s failure of telecommunications for the entire continent?

WHY do you think these failures keep occurring, despite increasing levels of redundancy?

Could it be the Complexity Paradox in action?

If Redundancy (and Complexity) Is Not THE Solution, What Is?

Let’s Weigh Together the Need For New Fail-Over Sites Against Proper Maintenance Of Existing Sites.

We all know that, like changing the oil in our car is crucial for keeping our car running, maintenance operations play a critical role in data center reliability. Yet maintenance is often overshadowed by the appeal of building spiffy new facilities.

And to state it plainly, enterprise IT management typically views critical facilities maintenance as- at best- an annoyance to be tolerated, or at worst, something to be scheduled only when convenient. The “if it ain’t broke, don’t fix it” mentality.

But I can’t overstate this key point: preventative maintenance is THE cornerstone of operational efficiency and longevity.

Preventative maintenance ensures that the current infrastructure operates at peak efficiency, addresses potential issues before they escalate into system-wide failures, and keeps hardware and software up to date with the latest security measures.

You want to change the tire on your car before it blows on the road, right?

Data Center Design Vs. Maintenance

Of course, data center design is so much more exciting and intriguing (just like buying a fancy new car), that it is focused on to almost exclusivity, in this arena. However, data center design isn’t particularly challenging for experienced people in the industry. There are a limited number of technological approaches to the problems of building a highly reliable and efficient data center, and a limited number of vendors that can provide the equipment. The principles and guidelines for design are widely known and disseminated by organizations such as the Uptime Institute.

But think about it, we’re focusing on only one side of the coin, and utterly ignoring the other side.

Imagine being an airline passenger again, you probably know that Airbus and Boeing are in the business of designing, building, and selling the most efficient commercial aircraft in the world. And while they issue maintenance recommendations to the owners of those aircraft, they are not in the business of maintaining them. That’s up to the owners.

Those airplanes depend on superb design, but for them to operate safely and reliably throughout their service lives, they must be maintained to an equally superb standard (monitored via government oversight). You definitely don’t want to fly on an airplane that has substandard maintenance, unless for some reason you want to end your life prematurely.

Here is a small secret: Airplanes- and data centers- are built using similar philosophies; anything man-made can- and will- fail, at some point. Both are designed to accommodate any single worst-case failure possible and still operate to a successful conclusion, reaching the destination, for an airplane, or maintaining service levels in the data center.

The big secret is that while excellent design will accommodate a single failure, excellent maintenance prevents a single failure from turning into a CASCADE failure.

HERE ARE THE THREE STEPS TO SAVE MILLIONS OF DOLLARS

Step 1 - Maintenance: The Proactive Path to Resiliency

A proactive approach to resiliency emphasizes rigorous maintenance regimes that keep existing facilities in prime condition. This includes:

Scheduled Inspections: Regular inspections can catch issues that, if left unaddressed, could lead to significant downtime.
Predictive Maintenance: Leveraging data analytics and AI to predict when systems might fail, allowing for preemptive intervention.
Component Upgrades: Keeping technology up to date is not just a matter of capacity but also of stability and security.
Disaster Recovery Drills: Regular testing of failover mechanisms ensures that when a switch to backup is necessary, it happens seamlessly.

Step 2 - Audit Programs: The Key To Verifying Maintenance Is Being Performed

Whether your company owns your data centers, relies on Data Centers as a Service (DCaaS), or uses a mix, it is absolutely vital to perform periodic audits of the critical facilities maintenance program.

Understand, whether you own or lease data centers, trusting a service provider to police themselves exposes your company to improper maintenance practices, unrecognized risks, and a higher risk of unplanned outages.

Equally important to understand, is that trusting your DCaaS vendor is properly maintaining their sites, by relying on the SOC-2

certification, is unwise. To be blunt, a SOC-2 certification isn’t worth the paper it’s not written on. Seriously.

Consider; as a passenger on that A380 airliner, would you bet your life on the certification from an accountant, of all things, that the airplane has been properly maintained? I know I wouldn’t!

The truth is that maintenance audits are only effective when utilizing highly experienced professional who have decades of experience with the equipment records being audited. And those audits can easily save millions dollars wasted on additional data centers.

It's a beautiful benefit - Having a detailed maintenance audit program to ensure availability will reduce unplanned outages.

At this point, you’re probably thinking I’m talking nonsense. HOW can I say such a thing?

Simple: I'm speaking from experience! I performed over 800 audits globally over a period of six years, with the overall result that there wasn’t a single unplanned outage in the portfolio while under my audit program. Not one, anywhere in the world.

What was my personal secret to such a stunning performance?

When you have decades of experience in your respective field, not only do you get very good at what you do, you also gain a unique ability to recognize hidden trends and issues, and translate those issues into meaningful results on a holistic level.

Many times I’ve found defects (single failures) that the local facilities management either failed to recognize, hadn’t adequately addressed to remove operational risk- or intentionally tried to hide. Exposing those defects and the underlying risk levels they represented to the stakeholders, allowed for rapid remediation. This rapid detection of single failures prevented them from “hiding” so as to become a link in the chain resulting in cascade failure later.

Put another way, every “weak link” in the chain was quickly detected, repaired or replaced, making the chain strong again.

I will cut to the chase: If you rely on Data Center providers, whether internal OR external, it’s vital you have an unbiased auditor to monitor the preventative maintenance process with ongoing, quarterly audits.

It is the only vehicle to protect your interests regarding availability. Any defects are naturally the responsibility of the vendor to mitigate, meaning no additional costs are incurred to the IT client. It has the key knock-on effect of forcing those DCaaS vendors with substandard maintenance to improve their operations, becoming better in every way. It gives you an on-going pulse of the vendors’ operations and keeps decision-makers informed of all risk elements. The best part for vendors is that their personnel come to appreciate the results. Indeed, many a DCaaS manager told me they were at first annoyed with my audits, but after they’d improved their processes, they would “breeze through” every other audit that came their way. In other words, they got a LOT better at their jobs.

Step Three - Realign Priorities To Put Maintenance Before Expansion

It is important for IT managers to understand that they should realign their priorities, putting maintenance before expansion. A well-maintained data center can offer robustness comparable to multiple failover sites, often at a fraction of the cost and complexity.

This shift not only ensures better allocation of financial resources but also encourages a "mission-critical culture" of maintenance that inherently reduces the necessity for failover occurrences.

IT managers must also adopt the “trust but verify” approach with regard to the maintenance effectiveness of the DCaaS companies they hire.

Here’s another secret; only about 25% of the SOC-2 certified DCaaS vendors I have audited actually “deliver the goods” consistently, quarter over quarter, year over year. If you don’t have an audit program in place, you’re running on blind faith.

By verifying your sites are being properly maintained, the risk of a site outage is significantly reduced, which in turn reduces the need for additional failover sites, and has the benefits of improving resiliency and sustainability.

Conclusion

For IT managers, the key to achieving true resilience in data center operations lies not in constructing a web of failover sites but in championing the cause of maintenance. It is through meticulous maintenance that we can preempt the need for constant failover readiness, reduce the environmental impact of our operations, and allocate our resources more efficiently.

For CEOs and CFOs, prioritizing maintenance to optimize existing assets, ensure security, and strengthen the backbone of our IT infrastructure is not just a technical decision—it's a strategic imperative for sustainable growth and resilience and a tremendous savings for you in a long run.

Rethinking Resilience: Three Steps To Save Millions Of Dollars

If Redundancy (and Complexity) Is Not THE Solution, What Is?

HERE ARE THE THREE STEPS TO SAVE MILLIONS OF DOLLARS

Conclusion

Recent Posts

Comments