By Steve Guzzardo
Published in the July 2005 issue of Today’s Facility Manager
Much has been written about 24/7, mission critical operations for data centers. While not all data centers are mission critical, they do all need to operate with a high level of reliability. The effect of unexpected downtime in these facilities can range from inconvenience to substantial financial loss.
The tug of war between continuous data center operation and maintenance is ongoing. Even in facilities with concurrent maintainability, repairs are still performed when risk is low due to possible human error.
The best way for data center managers and facility managers to limit human error is to take the time to generate detailed procedures for each type of maintenance event. For example, the procedures for critical electrical switching need to be detailed enough to include a line for both the operator and person verifying the actions to initial each step.
When facilities do not have clear procedures, the results are never beneficial. A good test of procedures is to give them to someone unfamiliar with the data center to see if he or she can find the equipment and perform the steps properly. One of the tools that facility managers can use to help increase reliability (and reduce downtime) in their data centers is a program of routine, ongoing testing and commissioning of systems, components, and networks.
The commissioning process—generally performed prior to the data center going online—is an effective tool that, when properly implemented, provides for a reliable operation from the first day of service. Fundamentally stated, commissioning is a quality assurance program that starts at project inception and never ends. The formal ASHRAE definition of commissioning, according to its Guideline 0-2005, states:
“[Commissioning is] a quality focused process for enhancing the delivery of a project. The process focuses upon verifying and documenting that the facility and all of its systems and assemblies are planned, designed, installed, tested, operated, and maintained to meet the owner’s project requirements.”
The commissioning team is formed during the pre-design phase and is responsible for leading the commissioning effort. Essential team members are the facility owner’s representatives (including facility manager, project manager, occupants or users, and operations and maintenance personnel), pre-design and programming professionals, and design professionals. Construction managers should also be included, if known.
Part of the team’s objective is to provide for team building among the facility manager, design professionals, construction manager, general contractor, vendors, and trades. As part of the effort, the commissioning team should track any changes to the owner’s project requirements (OPR) and seek the owner’s approval for all changes.
This process benefits facilities management in several ways. First, the facility manager acquires a high quality data center that is functional and reliable from the first day. Change orders are reduced in quantity and scope. Also, the facility operations and maintenance staff will have documentation regarding the installation, operation, and performance of the equipment and systems in the data center.
Generally, commissioning has several phases. The sooner the team is brought into the project, the greater the benefit. Phases include: pre-design; design; construction; acceptance; and staff training.
Pre-Design Phase: The commissioning team assists in the preparation of the project requirements and basis of design.
Design Phase: The team performs targeted design reviews for general quality, coordination between disciplines, discipline specific reviews, and consistency with the OPR and basis of design. The team also develops commissioning specifications, construction checklists, acceptance test procedures, and integrated system testing.
Integrated system testing checks that all the systems operate together without unwanted electrical or mechanical interactions. Simply stated, this testing is designed to verify that all building sub-systems (UPS, automatic static transfer switch, and emergency power) will function together to support and protect the business units.
Construction Phase: The commissioning team reviews the factory test procedures, attends factory acceptance testing, verifies that submittals meet OPR, and verifies that installed systems and assemblies comply with the OPR (through periodic construction inspections and verification of construction checklists).
Acceptance Phase: The commissioning team verifies execution of acceptance test procedures and verifies training of operations and maintenance personnel and occupants. In the case of mission critical facilities, selected critical equipment, such as UPS systems and other complex equipment, should be tested by the commissioning team or other independent testing firm to verify compliance with the OPR.
Staff Training: Proper staff training is imperative. This should start with the equipment training provided by the vendor and be supplemented with training on complete systems and interactions with other building systems. Staff training should continue with integrated technical instruction that teaches ways to identify and recover from faults. Safety training should be included and reviewed monthly.
Reliability Assurance Testing
Once the facility manager has a functional and reliable data center, the issue becomes how to assure the center continues to meet the OPR throughout its life. Continual commissioning, or reliability assurance (RA) testing, is one solution. If selected, it should begin immediately after completion of integrated system testing.
The frequency of RA testing depends on a number of factors, such as how critical the facility is and how capable management is of supporting concurrent maintenance. In general, testing is performed annually.
RA testing was born of necessity in mission critical facilities in the late 1980s and can be an integral part of continuous commissioning and predictive maintenance programs. This testing was developed to address the loss of critical systems that had been properly maintained by the vendor in accordance with preventive maintenance procedures.
The cause of the failures could not be traced to poor maintenance and a review of maintenance records did not identify any problems with all measurements appearing to be normal. As a result, many facility managers decided to seek engineering assistance to evaluate the system for the cause of the problem. Upon initial evaluation, the engineers could not find any problems with the installation, cooling, or other environmental issues. It was then decided that testing of the system would determine the health of the UPS system, and RA testing was conceived.
An RA testing program improves reliability by targeting and replacing components as problems develop, but it can also identify problems that occur during vendor maintenance events. To detect a post-maintenance problem, such as errors in calibration and assembly, RA testing saves the critical load and uses load banks (“dummy” loads for the electrical equipment) to simulate real loads. This can identify electrical problems without damaging sensitive equipment.
Another aspect of the RA testing program is to verify operation under a full rated load. This is somewhat equivalent to a stress test for a heart patient in that it helps to identify weak components under controlled conditions.
The information gathered over time with this testing helps to identify the useful lifespan of certain components and facilitates replacement of components a vendor does not have in its maintenance schedule. A structured RA testing routine and proactive maintenance approach can extend reliable UPS operation to 20 years or more in some cases.
A safe and efficient layout is important to a properly run data center. When reviewing plans for new equipment rooms or upgrades to existing spaces, facility managers should verify that equipment controls are located in sight and are within easy reach of operators. They should also verify that interlocked systems that prevent unsafe operations have keys or other systems that cannot be transferred between systems and allow unwanted operations.
When dealing with climate control, the cooling of high density blade servers is critical. It is important to have redundant cooling online, verified, and operational before starting maintenance work. One data center experienced a loss of chilled water for a high density area for six minutes and suffered the loss of several servers. The heat generated in high density areas is so intense that it takes only minutes for failures to begin. Those operating the data center should back up, back up, back up.
The need for increased reliability in data centers has risen in the past 20 years. It is obvious that those charged with maintaining these facilities need to take additional steps to help reduce downtime. If calls from work in the middle of the night are not the facility manager’s favorite event, the use of good commissioning practices and RA testing should help to limit those events and help facility managers get an uninterrupted night of sleep.
Guzzardo is managing principal at EYP Mission Critical Facilities, Inc. based in New York, NY. He can be reached at (845) 346-3900 or [email protected].