Infrastructure Topology Is Just ONE Side of the Coin
Words mean things, express concepts, ideas, and describe physical realities around us. We use industry-accepted conventions for things like infrastructural architecture in the mission-critical facilities discipline, such that while the subtleties of a particular site may be confidential, the overall performance expectations are easily understood. That being said, the Uptime Institute Tier Standard: Topology is a superb means to define data center infrastructure.
A close examination of that standard will make it abundantly clear that a Tier-III or Tier-IV data center should never suffer an unplanned outage due to power or cooling, because every single component has a fully redundant backup. Thus, a single failure anywhere in the facility will never compromise the operation of the overall facility. Put another way, there are no single points of failure, which cause the complete loss of power or cooling.
Like a jetliner, multiple levels of redundancy means that the failure of a single component in any system- no matter how critical- doesn’t compromise the entire system. The failure of an engine on a jetliner doesn’t doom the plane; it still has the ability to safely land.
It’s the same for Tier-III or Tier-IV data centers. They’re designed with the knowledge that anything man-made will fail, sooner or later, and accounts for that in the design. Hence the Uptime Tier standards, and the hundreds of construction companies who are fantastic in meeting those Uptime criteria. This area of the critical facilities arena is well-covered, and the results of their efforts have been nothing short of amazing.
And yet, according to the 2023 Uptime Institute outage analysis, outages keep happening! Roughly 60% of data center operators say they’ve had an outage in the past three years, that 44% are caused by power system failures, and 13% due to cooling system failures.
If the power and cooling system failures account for 57% of data center outages, despite being fully redundant, how in the world is this happening?
There Are TWO Mandatory Requirements for Data Center Operational Excellence
As mentioned above, the first key aspect of a having a critical facility that will support completely reliable IT operations for decades, is having a fully redundant architecture. Tier-III facilities have this as their standard, Tier-IV expands upon this to have “fault tolerance,” the ability to adapt to changing conditions to preclude cascade failures from occurring. But Tier-III is a far more economical approach, and should deliver perfect performance.
The fact that they’re obviously not delivering perfect performance, is due to the second key aspect of reliability in data centers: meticulous maintenance.
IF a data center is properly maintained 100% of the time, then a Tier-III facility can suffer any single failure and ride through until the failure has been mitigated. However, if the data center is not being properly maintained, then every single failure of any given component has the very real risk of becoming a cascade failure.
Let me say this again: Superior infrastructure topology negates data center outages due to a single failure. Superior MAINTENANCE prevents single failures from turning into cascade failures and outages. They are two sides of the SAME coin.
I wrote an article in 2016 titled “What IS ‘The Secret’ To Running Data Centers?”, where I poked holes in the claims of vendors of various types of technology that their offerings were THE KEY to attaining perfect uptime with regard to availability. I then proceeded to point out that, like the aviation industry, everything had to work, and described, point-by-point, what it takes to have an effective mission-critical facility that delivers >99.999% reliability for power and cooling:
Manpower is the first element to consider; do you have enough people to perform the work that needs to be accomplished on a daily basis? Do they escort vendors for service calls, which will take them away from their normal duties?
Training: Have the staff received proper training on site-specific gear, and has it been documented? Are shift personnel qualified for specific shift operations and maintenance functions?
Financial management: are OPEX and CAPEX budgets of sufficient magnitude to fund critical-facilities projects and normal operations? Are they separate from other budgets? [As one friend pointed out, “Data Center managers are squeezed to save every penny that they get so what suffers, the infrastructure.” And I would add, the people as well…]
Reference library: do you have one, with as-built drawings, operations and maintenance documents, warranty, commissioning, and automation sequences for operation? How about studies for infrastructure systems, soil, waters, structural? Are the master copies being kept in a safe, centralized location? Is there a system for managing floor-space, power and cooling? Are these aspects being monitored and a process in place to forecast future growth?
Organization: Does the staff have an established reporting chain, including a call-out list during emergencies? Do the staff have job descriptions available, which state their duties AND expectations? Are roles and responsibilities clearly delineated? Are key roles clearly established with their own duties and responsibilities? Is there a succession plan in place?
Maintenance: Is there an effective preventative maintenance (PM) program in place, including detailed procedures that are clearly written for technicians to follow? Is there a QA system to validate PMs are performed as desired? Is there an effective maintenance management system in place, where hours and equipment history are tracked? Is there a detailed inventory of spare parts on-hand, and list of suppliers and vendors when un-stocked spares are needed rapidly? Is there a life-cycle planning system in place? How about a failure analysis program, and tracking of deferred maintenance?
What’s REALLY Going On…?
Returning to the 2023 Uptime Institute outage analysis, let’s more closely examine the failure rates by components in the power system:
The report contains an interesting aspect, single-corded IT device failures, which industry professionals will instantly recognize as a ‘smoking gun’ of human-error. Pretty much every serious enterprise IT organization prohibits the installation of single-corded devices to preclude IT operational failures when a single-asset failure occurs.
But unless you’re an expert in the critical facilities management arena, you won’t notice the other ‘smoking gun’ that is hidden in plain sight:
Each of these systems (except for the single-corded IT device failures) is serviced by extremely well-trained technicians, almost always external vendors trained and licensed by the manufacturers of the equipment.
For example, Caterpillar generators have local vendors licensed and trained by Caterpillar. They are the very best at what they do; that’s all they do. So the likelihood of they’re making mistakes is essentially zero.
It must also be pointed out that those vendors, if the equipment they service fails because of a maintenance oversight, will be liable for losses incurred due to negligence on their behalf. Thus, they’re tremendously thorough in their maintenance operations. Their meticulousness assures their place as THE preferred vendors for mission-critical service, in their region, both with customers and the manufacturers they represent.
The likelihood that they’ll make mistakes so often, which allows 44% failure rates, simply isn’t plausible.
So what IS happening?
The vendor reports that I’ve examined over the past six years- which I conservatively estimate at more than 20,000 documents- are consistently detailed, with the job plan, steps taken, parameters noted, testing results, observations for end-of-life component replacements (such as batteries or UPS capacitors) and recommendations at the end, for mitigating conditions which compromise future reliability or operational readiness.
Of those >20,000 reports I have personally read, I can count on one hand the number of times a vendor report has dutifully reported parameters of a piece of equipment, where the parameters indicated an unusual condition was developing, and vendor failed to mark in the summary that the condition was developing and needed addressed. That means the chances of the vendor failing to notate an unusual condition was <.025% of the time.
So if the vendor is so accurate in their reports, WHY is the failure rate of electrical systems 44%?
The answer is the failure of the critical facilities management to take timely corrective action, whenever defects are discovered.
The Many Facets of Critical Facilities MIS-Management
There are a variety of ways that data center operators fail to heed vendor reports:
The vast majority of the time, the responsible manager who received the vendor report:
Doesn’t bother to read it, because they don’t understand that this IS a key part of the job. If you don’t read the reports, you’re not doing your job, simple as that.
While having full technical experience to do so, simply fails to read it. While this may sound ridiculous, it happens a LOT. I can’t tell you how many times I’ve been in an audit and discovered extremely serious defects the responsible manager simply didn’t know, because they didn’t bother to read the report, or breezed through it and missed important technical clues that problems were present.
Doesn’t bother to read it because they lack the technical knowledge to understand what it says. I’ve seen many examples where a battery vendor flatly stated the UPS batteries were in such bad shape that they were at risk of “thermal runaway.” When I questioned one particular manager as to why no action was taken, he asked, “what’s thermal runaway?" I answered, “your battery could spontaneously burst into flames and burn down the building.” His answer? “Oh…”
Assuming the manager received and read the report, AND understood the ramifications of defects detected, the most common failure-mode of maintenance management that often comes into play is financial; repairs were deferred, due to an inadequate budget.
This last element, I have seen dozens of times.
UPS Batteries Are Primary Reason For Data Center Outages
Again, if you’re not in the business, you would look at the graph above and say that UPS failure is a big part of downtime, but this is actually very deceptive. You see, failure of a UPS system itself is exceedingly rare. These systems have 50 years of design history, and the manufacturers have gotten very good in what they do. So stating that it’s the UPS units, is not accurate. You’ll also notice that UPS BATTERIES are not listed here; they’re invisible.
But UPS BATTERIES are the primary cause of power failures in data centers, for a several reasons:
They have a limited lifespan, usually shorter than most data center operators expect,
As they reach end-of-life, replacing the first failing cells piecemeal-fashion becomes counterproductive; it degrades the new cells quickly, and yet doesn’t slow the degradation of the rest of the battery string.
Full replacement of battery strings is expensive, requires high levels coordination and carries inherent operational risk. It must be done very carefully.
So data center operators prefer to avoid replacing failing batteries until they absolutely have to, often leaving IT clients exposed to the risk of unplanned outages, due to degraded UPS batteries.
Example situation:
A colocation company had VRLA batteries in one of the buildings, where the vendor reports and corresponding battery readings indicated imminent failure. The assumed life of the batteries was five years, but right at the 4-year mark, it became clear that the batteries were degrading to such as degree that they could NOT be relied upon to carry IT loads in the event of an unplanned utility outage for more than a few minutes (the expectation was 10 minutes of battery capacity).
The VP of the colocation company claimed the failure of a single VRLA battery would not take the entire battery string offline. (That’s EXACTLY how they fail, the vast majority of the time.)
The colocation company then stated the battery was not scheduled to be replaced for another year, and they’d not allocated in the budget for its replacement, so… can we leave you at risk until next year?
This happens ALL THE TIME… I kid you not.
UPS Capacitor Example
Another recent example was reading the annual UPS PM report from a vendor, where they stated that all of the UPS filter capacitors on both the input and output sides needed to be replaced immediately, as they were at end of life. The local manager had spotted the recommendations, tried to get the replacements done, but was over-ruled by upper management more concerned about saving a few bucks than keeping their equipment in excellent condition. It should be noted that when UPS capacitors fail, there are no warning signs of imminent failure, and when failure does occur, the results are usually catastrophic, involving explosions, fire, smoke…? Yeah. And it’ll happen when the UPS’s are at their maximum loading, i.e., when you need them MOST.
In that example, the local manager specifically asked me to mark this down (which meant the site automatically failed the audit) so as to put pressure on his management to get the repairs done. He was trying to do the right thing, but being over-ridden by upper management who were willing to put the IT clients at risk.
Other Power System Failures- Causes
As I have explained with the UPS batteries, all of the other power system failure causes are serviced by outside vendors, due to the deep expertise required to service the individual systems, the variables introduced by manufacturers, and the need to maintain factory warranties (which are voided if maintenance is not performed by a manufacturer-approved vendor).
So the other failures listed- generators, transfer switches, ABTs, STS’s, etc.- are all vendor-maintained systems, where the maintenance is almost always perfect. As with the “UPS failure” the failure ends up being not the equipment, not the vendor, but the management failing (for whatever reason), causing the outages.
How To Solve This?
The Uptime Report clearly demonstrates that the efforts by enterprise IT organizations to move to the cloud has not delivered relief from outages; in fact, they are creeping up. The reasons are exactly what I described above.
While moving to colocation facilities has financial benefits, there are other costs associated with such a move, as I describe in Colocation’s Hidden Flaw: Lack of Agency. And the SOC-2 certification from any colocation, with regard to availability, is prima facie fraudulent; CPAs can no more audit data center engineering operations than they can audit brain-surgery.
The occurrences of outage will get worse, I assure you: the increasing strain on the national electric grid and the soaring power demands are setting the stage for more frequent and severe outages. The mis-management of maintenance of the safety-nets that support your equipment are going to become more obvious and more expensive.
So the logical question now arises: HOW can the IT client of a colocation company make sure the infrastructure supporting their IT assets are properly maintained?
HOW do you make sure that the Tier-III or Tier-IV facility you’ve leased is actually maintaining their infrastructure, so that single-failure events don’t turn into cascade failures?
The Amerruss LLC Audit Program IS THE ANSWER
We offer the only proven availability assurance audit program in the world, bar NONE.
We achieve incredible results by performing the following steps:
Establish open lines of communication between colocation vendor and IT client, in such a way that critical facilities maintenance can be openly discussed in a transparent manner.
Perform a deep-dive analysis of all technical assets supporting the IT client, whether the client is located in a small cage or leases an entire building, to have road-map of what assets support the clients’ equipment. If this is a new facility, that would also include reviewing all commissioning documentation to be absolutely certain that there are no hidden defects from construction.
Document the findings of step #2 in basic diagrams, to be included with the audit results, so clients can easily understand how any defects detected fit into the holistic picture.
Verify the maintenance calendars match industry standards, and then monitor adherence of the colocation provider to that calendar. Any significant deviations require investigation. Trends of deferment would be defects, to be investigated.
Perform a deep dive of sufficient maintenance records of critical facilities assets, to achieve at least a 95% confidence interval (as a practical matter, it's FAR better than 95%). This includes reading CMMS reports, and all vendor reports, examining the recorded data, to derive underlying trends of equipment health.
Investigate and document any defects noted. All defects that introduce real risk to the client are tracked to solution.
Follow up on any defects from past audits that needed correction.
Report the results to the client- and to the colocation vendor, to maintain full transparency.
Repeat on a quarterly basis. Subsequent audits are much faster and the costs are therefore nominal, since all core data is developed during the first audit cycle. This offers clients as close to real-time feedback as possible, keeping a pulse of ongoing maintenance operations, and exposing any new risks that develop VERY quickly.
It is important to understand that any defects found, are contractually the responsibility of the colocation vendor to mitigate, as the colocation contracts always stipulate the equipment will be maintained in accordance with industry practices and manufacturer recommendations. Thus, all repair costs are the responsibility of the colo vendor, NOT the IT client.
The proven results of the audit program I developed and operated over the past 6 1/2 years, and now offer to you, resulted in perfect uptime of >60 sites spread across a global portfolio. When compared to the probabilities of downtime as published by the Uptime Institute 2023 outage report, where 60% of respondents had suffered an outage in the previous three years, the likelihood of the portfolio under my audit program NOT suffering an outage was 2.04x10^-126. In other words, even competing against other critical facilities management professionals, my audit program delivered results that are statistically impossible, yet this was the result.
With our audit program, your need to continually expand your IT portfolio to more and more colocation facilities- a redundancy “arms race,” is no longer needed, and you can save millions by having our audit program resolutely monitoring the equipment that keeps your company safe and secure.
You won’t lose sleep at night, wondering if your company is exposed to hidden availability risks; we keep a sharp eye on things for you.
With increasing risks to the electric grid, you can no longer rely on an ersatz certificate like the SOC-2; you need to know the critical facilities components supporting your business are being properly maintained, so that WHEN utility interruptions occur, you’re sail through them without issue.
Reach out to Amerruss LLC today to initiate a tailored audit program for your data centers.
With our expertise and proven track record, we can transform your approach to data center management, ensuring resilience, efficiency, and, most importantly, uninterrupted service. The future of your IT infrastructure demands nothing less.

Comentarios