By Dennis Bouley
Published in the February 2008 issue of Today’s Facility Manager
Data center physical infrastructure preventive maintenance (PM) is often neglected as a vital tool in reducing total cost of ownership (TCO). One way facilities managers (fms) can contribute to the goal of controlling TCO is by improving data center support systems uptime through a better understanding of PM best practices.
At the basic level, PM can be deployed as a strategy to improve the availability performance of a particular data center component (e.g. the UPS). At a more advanced level, PM can be leveraged as the primary approach to ensuring the availability of the entire data center power train (generators, transfer switches, transformers, breakers and switches, PDUs, UPSs) and cooling train (CRACs, CRAHs, humidifiers, condensers, chillers). A monitoring system that ties together all of these critical components and communicates data back to an individual who understands the integrated power and cooling train (and who can properly interpret the system messages) represents great value.
Most data center sites require one or two PM visits per year. However, more PM visits may be necessary if the physical infrastructure equipment resides in a hostile environment (high heat, dust, contaminants, vibration). The system design of the component may also impact the frequency of PM visits.
The PM professional should observe the data center environment (circuit breakers, installation practices, cabling techniques, mechanical connections, load types) and alert the fm to the possible premature wear and tear of components. The fm should be aware of all factors that may have a negative impact on system availability (possible human error handling equipment, higher than normal temperatures, high acidity levels, corrosion, and fluctuations in power being supplied to servers).
Evidence Of PM Progress
Today’s data center physical infrastructure is more reliable and maintenance friendly than it once was. Manufacturers compete to design components that are as error free as possible. Examples of improved hardware design include:
- computer room air conditioners (CRACs) with side and front access to internal components (in addition to traditional rear access);
- variable frequency drives (VFDs) in cooling devices to control speed of internal cooling fans. VFDs eliminate the need to service moving belts (which are traditionally high maintenance items);
- wraparound bypass functionality in UPS that can eliminate IT downtime during PM; and
- redundant cooling or power designs that allow for concurrent maintenance; the critical IT load is protected even while maintenance is being performed.
Software Design As A Critical Success Factor
Efficient physical infrastructure management software design is being pushed to the forefront as the critical success factor for maintaining high availability. Best in class data centers leverage physical infrastructure management software.
By using a predictive failure approach that is enabled through software, data center equipment (capacitors, for example) are replaced only when continuous onboard diagnostics make a recommendation for replacement. This is a stark contrast to the traditional approach of: “It has been six months, and it’s time to replace them.” Adhering to predictive failure practices avoids unnecessary execution of invasive procedures, which injects the risk of human error leading to downtime.
Through self diagnosis, infrastructure components can communicate usage hours, broadcast warnings when individual components are straying from normal operating temperatures, and indicate when sensors are picking up abnormal readings. Although PM support personnel are still required to process the communications output of the maintenance management system, the future direction is moving towards physical infrastructure systems that are completely self healing.
Changing Scheduling Practices
Traditional maintenance scheduling practices were established in the days before system availability became a significant concern for data center owners. Nights, weekends, and three day holiday weekends were, and are still, considered common scheduling times. However, the rise of the global economy and the requirement for 24/7/365 availability has shifted the maintenance scheduling paradigm.
In many cases, the justification for scheduling PM only on nights and weekends no longer exists. In fact, a traditional scheduling approach can add significant cost and additional risk to the PM process.
From a simple hourly wage perspective, after hours maintenance is more expensive. More importantly, services and support personnel are likely to be physically tired and less alert when working overtime or when performing work at odd hours. This increases the possibility of errors or, in some cases, can increase the risk of personal injury.
A PM provider/partner can add value by helping the data center fm to plan properly for scheduling PM windows. And, in situations where new data centers are being built, the PM provider/partner can advise the fm on how to organize the data center floor plan in order to enable easier, less intrusive PM.
Another good practice for data center PM is thermal scanning of racks and breaker panels. Abnormal temperature readings can prompt a required intervention. Infrared readings can be compared over time to identify trends and potential problems. In this way, an electrical connection, for example, can be retightened based on scientific data instead of a guess.
The thermal scanning approach can also be applied to switchgears, transformers, disconnects, UPS, distribution panel boards, power distribution units, and air conditioner unit disconnect switchers. Computational Fluid Dynamics (CFD) can also be used to analyze the temperature and airflow patterns within the data center and determine the effect of cooling equipment failure.
The Condition Based Maintenance Approach
Condition based maintenance is a type of PM that estimates and projects equipment condition over time, using probability formulas to assess downtime risks. A condition based maintenance approach will help to identify particular units that are most likely to experience defects requiring repairs. For example, a UPS that might be experiencing stresses because it often switches to battery power due to poor utility power quality will be identified for increased probability of future failure.
A condition based maintenance method also identifies, through statistics and data, which equipment components most likely will remain in acceptable condition without the need for maintenance. Maintenance can therefore be targeted where it will do the most good and cause the least disruption.
Consideration Of PM Options
PM maintenance services can either be purchased directly from the equipment manufacturer or from third party maintenance providers. In some cases, the third party maintenance provider is “authorized,” and in other cases it is “unauthorized” or unaffiliated with any manufacturer.
The selection of a maintenance organization capable of supporting the PM vision for the data center is an important decision. Such organizations can be global in scope, or can offer regional or local support.
Most unauthorized, third party maintenance companies are local or regional in scope; they tend to work on fewer equipment installations. As a result, their learning curve may be longer regarding technology changes.
Since they have fewer direct links to the manufacturer and manufacturing sites, most of these maintenance providers cannot provide an escalated level of support. They do not have the benefit of leveraging the global continuous improvement PM data from installations all over the world.
On the other hand, an authorized maintenance provider or equipment manufacturer can maintain thousands of pieces of equipment across all geographies. An organization of this type can leverage tens of thousands of hours of field education to improve its maintenance practices and enhance its expertise. Data gathered by the factory trained field personnel is channeled to the R&D organizations, so it can analyze the root cause of breakdowns. The manufacturer’s R&D groups analyze the data and build needed hardware and software improvements into product upgrades that then form the basis for the next PM.
A global exposure also allows for manufacturer based service personnel to maintain a deeper understanding of integrated power and cooling issues—a knowledge that can be applied to both troubleshooting and predictive analysis.
PM is a key lifeline for a fully functioning data center, and it is important to make sure that the service provider is capable of handling the facility’s needs. Maintenance contracts should include a clause for PM coverage, so the data center fm can rest assured that comprehensive support is available when required. Equipment manufacturers can package maintenance contracts that offer hotlines, support, and guaranteed response times.
Currently, the PM provider in the strongest position to provide such a level of support is the global manufacturer of data center physical infrastructure. An integrated approach to PM allows the data center fm to hold one partner accountable for scheduling, execution, documentation, risk management, and follow up. This simplifies the process, cuts costs, and enhances overall systems availability levels.
Availability To Expertise Is Key
The current PM process must expand to incorporate a holistic approach. The value that PM services add to common components today (such as a UPS) should be expanded to the entire data center power train (generators, transfer switches, transformers, breakers and switches, PDUs, UPSs) and cooling train (CRACs, CRAHs, humidifiers, condensers, chillers).
Bouley is a strategic research analyst with APC’s Data Center Science Center in West Kingston, RI. He holds a bachelor’s degree in journalism and French from The University of Rhode Island and a Certificat Annuel from the Sorbonne in Paris, France. He has nine years of experience with APC interviewing global clients about their data center environments. Prior to joining APC, he was employed by IBM for 10 years.