Data Center Operations In 2024

Avoid outages by identifying deficiencies and mitigating risks.

By Ron Davis
From the June 2024 Issue

 

In today’s digital age, data centers serve as the backbone of modern technology infrastructure, supporting an ever-increasing volume of data generated and consumed by businesses and individuals worldwide. As organizations continue to expand their digital footprint and leverage data-driven technologies, the industry is facing immense opportunities as well as daunting challenges. How we address these challenges will significantly impact our industry’s future.

As we slowly transition into a post pandemic period, the word that best describes the data center industry is “growth.”

Growing Opportunities

As we slowly transition into a post pandemic period, the word that best describes the data center industry is “growth.” Growth in demand drives growth in capacity, which in turn drives growth in budgets. Whether colocation or enterprise, most respondents in our recent Uptime Institute Capacity Trends and Cloud Survey 2023 report capacity increases, a metric that has been trending up year over year (Fig 1).

Data center size is growing as well, with massive “gigawatt” campuses being proposed to meet skyrocketing demand. Uptime has identified plans for 26 mega data centers, which, if built and operated at half of their projected capacity, would account for approximately 45 terawatt-hours (TWh) of energy use a year.¹

Data Center Operations
Figure 1: Data Center Growth Rate (Graphics courtesy of Uptime Institute)

Growing Challenges

It is not all sunshine and rainbows; however, as “growth” also describes some other less positive aspects of our industry. “Costs” are also growing. While recent data shows supply chain delays are beginning to decrease, most operators say equipment prices are rising, which tops the list of supply chain disruption factors having the most significant impact on organizations (Fig 2).

Figure 2: Supply Chain Disruption Factors

 

It is not just the cost of equipment that is on the rise. The recently released Uptime Institute Annual Outage Analysis 2024 reports that while overall outage frequency and severity are declining, the expanding global data center footprint is expected to increase the overall number of data center-related outages. Outages are expensive, as reported in Uptime Institute Annual Outage Analysis 2024: “More than half of the respondents (54%) to Uptime’s 2023 annual survey say their most recent significant, serious, or severe outage cost more than $100,000, with 16% saying that their most recent material outage cost more than $1 million” (Fig 3).

Data Center Operations
Figure 3

 

When we look at data center outages, we also need to review the source of the incident or outage. But outage incidents are multi-faceted and often hard to attribute to a single cause or factor. The Uptime Institute Annual Survey of IT and Data Center Managers consistently and overwhelmingly points to power as the cause of outages and has done so for many years. (Figure 4)

Figure 4: Primary Cause of Data Center Outage or Incident

Digging further, the human factor figures prominently in any root cause analysis. UI’s position, as stated in the UI Annual Outage Analysis 2024, is as follows “Uptime Institute tends to analyze human error as a contributing factor rather than the sole or primary cause of outages. Drawing on over 30 years of data, Uptime estimates that human error, whether directly or indirectly, contributes to a significant majority — ranging from two-thirds to four-fifths — of all downtime incidents”. The Uptime Institute Data Center Resiliency Survey 2024 reveals most the common human factor culprits. As seen in Fig 5, the leading reported cause of human-error is around operating procedures, with “failing to follow procedure” and “incorrect staff processes / procedures” combining to make up 93% of all human error related incidents.

Data Center Operations
Figure 5: Most Common Causes of Major Human Error Related To Outages

The topic of human error makes for a nice segue, because another example of a troubling “growth” in the industry is the number of operators struggling with staffing and skills shortages. Operators are reporting that they have trouble filling open positions, particularly those for junior/mid-level operations staff. As always is the case when the talent pool is low and demand is high, competition for those resources within the industry becomes intense, and staff retention becomes an issue. (Fig 6)

Figure 6

 

A Programmatic Solution

As outlined above, our industry is growing in all manner of ways, good and bad. And this kind of growth, combined with insufficiently addressed challenges, can be disastrous. In our industry, we are witnessing sites become bigger sites, and the bigger sites become campuses, which become giga-watt campuses, which become huge organizational portfolios. If left unaddressed, even relatively small site level challenges can become portfolio level nightmares that can waste energy, squander resources, and cause outages that damage individuals, businesses, financial institutions, governments, etc. It is not overstating the situation to imagine a widespread outage in a large site or campus that could have societal, even global impact.

An incident and/or outage is an effect, the cause of which is typically an ill-developed, poorly implemented, or improperly applied component of an organization’s Data Center Operations Program:

You cannot find a technician to fill that open night shift position?

  • How diverse is your recruitment pool?
  • Do you have a professional development program?

Site Chiller failure?

• A good maintenance program may have prevented it, or possibly predicted it, and provided you with the meantime between failures.

A missed step during a routine Generator run?

  • Is there a fully scripted, carefully reviewed / approved procedure?
  • Was there an operational risk assessment that identified the critical action steps and possible failure modes of the procedure?
  • Are all operational personnel evaluated for the risk profile that is common according to their experience, background, and other factors? Have they been trained to understand those risks and how to avoid them?

Mistake during operational emergency response?

  • A well-developed training program would have provided the knowledge necessary to avoid it.
  • AND an available, professionally written, and rigorously drilled emergency operating procedure would have provided guidance, focus, and preparation.

Overdue maintenance for site UPS modules?

  • How is your vendor management program?
  • Have you performed staffing workload calculations? Have you applied the results?

Supply chain issues delaying a project?

  • Does your supplier program proactively seek secondary sources?
  • Have you looked into hiring independent logistics contractors to manage?

These are all examples of individual scenarios and/or incidents with proven programmatic solutions.

Let’s consider how the potential impact of deficiencies magnify as an operating program’s scope of applicability increases. For example, think of a small site with an informal and inadequate equipment maintenance management. For that site, the potential impact radius of a failed generator or UPS unit is a problem, but still relatively small. Now imagine a giga-watt campus, built out and commissioned over a period of just a few years that suffers the same programmatic deficiency. That is a huge amount of installed infrastructure, all of which is roughly the same age, which, due to poor maintenance management, is operating on borrowed time. What if their weakness is poor lifecycle management? Now that campus has a huge amount of installed infrastructure for which there is no plan for refurbishment, replacement, or improvement. What if it is just one of many campuses in a global organizational portfolio? The implications are immense and extremely negative.

Data Center Operations
(Photo: Adobe Stock / Viacheslav Yakobchuk)

 

So, what is the answer? How do we address these challenges before they become organizational disasters? It is the same answer that it has always been. The solution that works just as well for the small data center as it does for the giga-watt campus or global organization: A carefully conceived, well developed, effectively communicated, and strictly applied Data Center Operations Program, which is regularly measured for efficacy against an established standard. If the program is applied across a portfolio of data centers, they should also be measured for consistency against other sites in that portfolio.

Demand For Data Centers, AI Puts Pressure On U.S. Power Grid

A new report finds the rise of artificial intelligence (AI) threatens to strain the power grid, but also promises new efficiencies. Read more…

Today, it is extremely unlikely that any data center, much less a global portfolio of data centers, is operating without a formalized data center operations program. But in my experience, there are a surprising number of organizations that are extremely deficient in one or more components of their program, and it is definitely a “weakest link breaks the chain” situation.

This is an exciting time to be in data center operations.

Innovative solutions like Direct Liquid Cooling are beginning to get into mainstream sites. The promise, and demands, of artificial intelligence are increasing at an unimaginable rate. Our industry is impacting every aspect of everyday life. It is up to us to apply our hard earned, foundational, and proven operational behaviors to these new and exciting applications. Are you ready?

Notes

¹ “Hyperscale colocation”: the emergence of gigawatt campuses by John O’Brien

Ron Davis, VP, Digital Infrastructure Operations, Uptime Institute Davis, Vice President of Digital Infrastructure Operations for the Uptime Institute, has over 40 years of experience in facility construction, preventive, corrective, and predictive maintenance, as well as associated staff and program management. His last 15 years have been solely focused on data center operations, during which he has developed, implemented, and managed global data center operations programs for Schneider Electric and NTT.

Do you have a comment? Share your thoughts in the Comments section below, or send an e-mail to the Editor at jen@groupc.com.

Read more facility management-related news about data centers.

Bonus Features, Business Continuity, Energy Management & Lighting, Facilities Management, FacilityBlog, Featured, Magazine, Professional Development, Special Reports, Technology

Data Center Operations, Data Centers, Direct Liquid Cooling, Energy Management, Facility Management, FE-June-2024, Maintenance, outages, Professional Development, Technology, Uptime Institute

Sponsored Content
Featured Video

Webinars, Podcasts & Videos

Under the care of ABM, systems perform, businesses prosper, and occupants thrive.

Where Others See a Facility, We See Possibility

Under the care of ABM, systems perform, businesses prosper, and occupants thrive.

crime scene

Listen Now: What To Do When Your Facility Becomes A Crime Scene

A business continuity analyst discusses steps FM teams need to consider after a crime has been committed in their buildings.

Facilities Teams, ARC Facilities Webinar

Did You Miss Our “Solving The Hidden Assets Challenge” Webinar?

Hidden assets can be a challenge for facilities teams. View this free video webinar on demand and learn how your team can retain knowledge and streamline operations.

Receive the latest articles in your inbox

Share to...