top of page

The 2024 Uptime Institute Annual Outage Analysis, and Why Data Centers Aren't as Reliable as You Think

Dive into "The 2024 Uptime Institute Annual Outage Analysis, and Why Your Data Center Isn't as Reliable as You Think," an eye-opening exploration that sheds light on the unsettling realities of data center reliability. This insightful essay unpacks the latest findings from the Uptime Institute, revealing a concerning trend of increasing opaqueness in outage reporting, a slow decline in outage frequency against a backdrop of rising cyber incidents, and the climbing costs of outages.

With a staggering 74% of outages attributed to mission-critical infrastructure failures, the analysis highlights the stark discrepancy between expected and actual performance in Tier-III data centers, emphasizing the urgent need for a shift towards more rigorous maintenance and operational integrity. For C-suite executives in enterprise IT, this essay is a clarion call to reevaluate and enhance the resilience of their data centers in an era where traditional assurances no longer suffice.


The executive summary of the Uptime Institute Annual outage analysis for 2024 came out last Friday, which can be viewed as a PDF or covered in a webinar.  The key findings are:


  • Outage information is increasingly opaque, unreliable

  • Outage frequency and severity are falling slowly- but cyber incidents are rising

  • Outage costs continue to rise slowly

  • Power causes the most impactful outages, but networking issues cause more outages overall

  • Four in five say the most recent impactful outage was preventable

In other words, no real surprises from Uptime, at least on the surface...

But when you carefully examine the report, the results are nothing short of stunning.


UPTIME NUMBERS

















Restating the above information, 18.33% of data centers have outages each year, or a rate of 1 in 5.5 per year, and the odds of a “severe” outage each year is 1 in 74.  


Tier-III data centers (the standard in the industry), have a minimum requirement of 99.982% uptime per year, which means they have a 0.018% chance of downtime, or 0.00018 as a probability.  Let's calculate the odds using this probability:







Here, the probability of an event (unplanned outage) is 0.00018. We'll compute the odds of an unplanned outage occurring in any given year for a Tier-III data center.


=(.00018/(1-.00018) = 0.000180036


To convert to “1 in X” Format we find the reciprocal of the odds:


=1/Odds  =1/0.000180036  =5555


Therefore, the odds of a Tier-III data center having an unplanned outage in any given year are approximately "1 in 5,555," which translates to once every 5,555 years on average. This process highlights the high level of reliability and availability expected from Tier-III data centers.


Now, let’s compare this to the actual results of (largely) Tier-III data centers, especially in the colocation space, of “1 in 5.5.”  


The actual results in industry are 1,010 times WORSE than they’re supposed to be, for the given Tier rating.

The actual results in the industry for Tier-III facilities are 74 times WORSE than expected, delivering a “severe” outage.



WHAT SYSTEMS ARE CAUSING THE FAILURES?


Now let’s look at where the outages are happening, in these facilities:















74% of outages are due to mission-critical infrastructure failures!


One does not need to be an expert in the field, to quickly conclude that there’s a major issue in the critical facilities space.  These numbers are absolutely TERRIBLE.


Going back to a previous article, if the aviation industry performed as wretchedly as the data center industry, we’d see 370 commercial plane crashes per year, or one every day.  With such odds, would YOU fly commercially???


WHY DO THE CRITICAL FACILITIES KEEP FAILING?


I had written before that maintenance mismanagement, NOT human error, is the #1 cause of data center outages.   The latest report actually reinforces that conclusion.  The above image shows that 74% of outages are due to failures of critical facilities' infrastructure, but the breakdown of what systems fail, really tells the tale:

















UPS FAILURES, or UPS BATTERY FAILURES?


Modern UPS systems are extremely reliable and have a long service life.  


UPS batteries are extremely reliable, but they do NOT have a long service life and require extremely diligent monitoring of condition.  To be blunt, THEY are the #1 cause of data center failures, as a specific system.  (Don’t you find it interesting that they’re NOT listed in the causes of outages?)  


UPS batteries have a relatively short life (theoretically 5 years, on average), and must be carefully monitored.   The individual battery cells in a string degrade unevenly; some fail within a year or two.  This matters because most batteries today are the VRLA type, and if a single cell fails, the entire battery string fails, similar to a weak link in a chain.  “Bad” cells MUST be swapped out as SOON as they are detected.  


THE OTHER CAUSES


The next four causes of data center outages are all systems that run extremely reliably for decades, provided they are properly maintained, periodically exercised, and repaired if ANY defect found is promptly addressed.  


Those of us who’ve been in the business a long time easily recognize how damning the above graph truly is…  these results can ONLY come from gross maintenance mismanagement, period.


WHAT INDUSTRIES ARE BEING HIT THE MOST?















The above graph seems simple on the surface but is probably the most important data out of the report.  Cloud and “internet giants,” digital services, and telecoms suffer the most, by FAR, accounting for 75% of “high profile outages” in the data center world.


Now here’s the question:  What do the cloud/internet giants, digital services, and telecom companies all have in common?


Take a minute, and let it run through your mind…  consider the overall digital world in a holistic way.


While you’re thinking about the answer, let’s bring up a few other points that most people don’t think about, regarding the colocation data center market.


  • Colocation data center failures significantly amplify business impacts within the digital world, primarily because a single colocation facility often hosts the critical infrastructure of numerous companies.  So when a colocation data center experiences an outage, it doesn't just disrupt operations for a single entity; instead, it cascades across all the businesses reliant on its services, magnifying the overall effect on the digital economy. 

  • Cloud companies and “internet giants” have their own networks of interlinked data centers, where loads are easily and seamlessly transferred from one building to another.  Their integrated systems are optimized for computing work among dozens of facilities, are very robust, and are extremely well-maintained.  BUT, these companies must go to colocation facilities for their pipelines back out to the internet, and the rest of the world.  Cloud companies are not in the business of organizing points of presence to the world, getting ISPs lined up (and all that entails), that’s the business of colo companies.

  • Telecom companies are in the business of…  Well, telecommunications.  So again, they rely on colocation companies for their business operations, peripheral services, internet presence, etc.  Telecom companies put most of their capital expenditures into infrastructure that supports their core business, which doesn’t include data center operations; it’s cheaper to outsource, same as with cloud companies. 

  • “Digital services” include streaming services, digital communications, business automation, social media, and online banking, for example.  As with the above examples, these are services where their core business is digital products, not data centers.  Thus, they heavily rely on colocation providers, as an option that is perceived as both cost-effective and agile. 


WHAT IS THE COMMON LINK?


It’s simple:


All three sectors business sector that suffer from outages are the ones that rely most heavily on colocation companies for their digital pipelines to the world, or much of their enterprise IT service support.  


To be blunt, the reason why data center outages are occurring 1000X more than they should be, is because of maintenance mismanagement of colocation companies.  



Unfortunately, enterprise IT customers of colocation companies suffer a loss of agency when they go to colocation companies, surrendering all control of the management of critical facilities. 

So enterprise IT companies must trust SOC-2 certificates and SLAs offered by the colocation companies.  



And as I have clearly explained, colocation SLAs are worthless, because you will not receive compensation for damages resulting from colo companies failing to meet the terms of the SLAs.  This point is now starting to gain visibility; in the lower right corner of the image directly above it says “the data underlines the importance of third party agreements/the need to focus on SLAs.”  Uptime and other groups are now starting to recognize this as a real problem.


CONCLUSIONS:


Drawing insights from the comprehensive Uptime Institute Annual Outage Analysis for 2024, data center resilience continues to be challenged by a myriad of factors, notably including maintenance mismanagement and the rising tide of cyber incidents. 


The report provides a stark reminder that while incremental progress is being made, the industry is far from achieving the promised reliability and security.   


The Urgency for Enhanced Resilience: The data unequivocally points to a critical need for heightened focus on maintenance and operational integrity within data centers, especially colocation facilities.  Despite advancements, the gap between expected and actual outage rates-three orders of magnitude- underscores a pressing need for a paradigm shift in how we approach data center resilience.


The Role of Proactive Auditing: With 74% of outages attributable to mission-critical infrastructure failures, the significance of rigorous, independent audits cannot be overstated. These findings reiterate that a proactive stance on maintenance, underpinned by deep technical expertise, is indispensable in preempting and mitigating potential outages.


A Call for Accountability and Transparency: The Uptime analysis sheds light on the opaque nature of outage reporting and the limitations of current industry standards, such as SOC-2 certifications and SLAs, in providing tangible assurances of data center reliability. This underscores the need for a new standard of accountability, one that transcends traditional certifications and genuinely reflects the operational effectiveness of data centers.


Call to Action for the Amerruss Resilience Program:


In response to these insights, the Amerruss Resilience Program stands as a beacon of innovation and reliability, offering a solution that directly addresses the infrastructure vulnerabilities and gaps identified in the Uptime Institute's analysis. Our program is designed to elevate data center resilience by employing an unmatched combination of independent, thorough audits, and a comprehensive approach to maintenance management.


We invite C-suite executives and senior IT managers to engage with the Amerruss Resilience Program, to not only bridge the gap between expectation and reality but to set a new benchmark for operational excellence in data centers. Our commitment is to safeguard your IT infrastructure against the unforeseen, ensuring your operations remain resilient, efficient, and uninterrupted.


Embark on a Journey to Unparalleled Resilience:


Reach out to Amerruss today to explore how our tailored solutions can fortify your data center against the myriad challenges of the digital age. Let us transform your approach to data center resilience, ensuring your infrastructure is not just compliant, but exemplary. Don't leave your data center's reliability to chance. Partner with Amerruss and witness the transformation to operational excellence and unwavering reliability.






bottom of page