Understanding The Cost Of Data Center Downtime

Transcription

UNDERSTANDING THE COSTOF DATA CENTER DOWNTIME:An Analysis of the Financial Impacton Infrastructure Vulnerability

Executive SummaryOver the course of the past decade, enterprise business has fundamentally changed. Among the many changes experienced,none has been more profound than the increase in reliance on information technology (IT) systems to support business-criticalapplications. For many of today’s enterprises – including banks, telecommunications companies, internet service providers andcloud/co-location facilities – data center throughput has evolved into monetized commodity. No longer simply supporting theinternal needs of the organization, data center availability has become essential to many companies whose customers pay apremium for access to a variety of IT applications.This unprecedented reliance on IT systems has forged an even stronger connection between data center availability and totalcost of ownership (TCO). A single downtime event now has the potential to significantly impact the profitability (and, in extremecases, the viability) of an enterprise. Unfortunately, a severe disconnect exists between IT personnel and their C-suite counterpartswith regard to understanding the frequency and the cost of data center downtime.Recognizing the need to address these misconceptions, Vertiv partnered with the Ponemon Institute to conduct two in-depthstudies on the perceptions, causes and true monetary costs of data center downtime – totaling thousands of dollars per minuteon average – as well as which infrastructure vulnerabilities have the most significant and costly impact on the availability of criticalIT systems (see “National Survey on Unplanned Data Center Outages” and “The Cost of Data Center Outages”).In addition to examining the differing perceptions between the C-Suite and IT staff, this white paper takes a detailed look at thepotential “bottom line” costs of data center downtime and examines how power, cooling, monitoring and service inadequacies cancontribute to a facility’s risk of downtime. It explores specific data center infrastructure vulnerabilities and associated downtimecosts, as well as recommendations for fortifying these infrastructures to minimize downtime and achieve the highest possiblereturn on investment (ROI). Finally, it offers a long-term business case for addressing these critical vulnerabilities as well as factorsCIOs and IT personnel should consider when prioritizing their actions and investments.2

Introduction:Downtime Perceptions vs. RealitiesSince the “dot com” boom (and subsequent bust) of the late90s and early 2000s, IT networks and data center systemshave experienced a resurgence in the central role they playin revenue generation and business growth. Fromstreamlining customer service and networking to facilitatinga variety of e-commerce and enterprise IT services, datacenters have evolved into business foundations forcompanies in a wide range of industries. Furthermore, as ITservices become increasingly commoditized (via co-location,disaster recovery and cloud computing services), theeconomic impact of data center operations will continue togrow at an unprecedented rate.However, even though more enterprises depend on theirdata centers to support business-critical applications thanever before, significant infrastructure vulnerabilities andmisperceptions about the frequency and cost of IT failureshave put many companies at increased risk for costlydowntime events.According to a September 2010 Ponemon Institute studycommissioned by Vertiv , misconceptions about thefrequency and impact of data center downtime havebecome commonplace in businesses across the UnitedStates. The survey of more than 400 data center and IToperations professionals revealed a widening disconnect inperceptions being perpetuated between the C-suite and“rank-and-file” IT staff:yySeventy-one percent of senior-level respondents believetheir company’s business model is dependent on itsdata center to generate revenue and/or conducte-commerce. Only 58 percent of rank-and-filerespondents shared this belief.yyThough respondents experienced an average of twodowntime events over the two-year period studied(lasting up to 120 minutes apiece, on average),62 percent of senior-level respondents agreed thatunplanned outages did not happen frequently.Forty-one percent of rank-and-file respondents alsoagreed with this statement.yySeventy-five percent of senior-level respondents feeltheir companies’ senior management fully supportsefforts to prevent and manage unplanned outages, whilejust 31 percent of supervisor-level employees and belowagreed with this statement.yyLess than 32 percent of all respondents agreed theircompany utilizes all best practices to maximizeavailability of critical IT equipment (40 percent at theexecutive level; 29 percent at the rank-and-file level).Based on these findings, it is clear that executive-levelrespondents are extremely cognizant of the economicimportance of their company’s data center operations.This is not surprising, as the core responsibility for seniormanagement and C-level executives (including ChiefInformation Officers) is to understand how all facetsof the business contribute to a company’s growthand performance.Survey responses also indicated that most of theseexecutives are not as in-tune to the day-to-day data centeroperations as rank-and-file employees specifically chargedwith maintaining the company’s IT infrastructure. As such,many of the executives surveyed are not as aware of thefrequency of downtime events and the vulnerabilities intheir data center infrastructures that are contributingto these events.Conversely, rank-and-file IT staff are more aware of thefrequency of system failures and specific vulnerabilities intheir companies’ data center infrastructures than theirexecutive-level counterparts. However, fewer rank-and-filerespondents actively acknowledge the role of theircompanies’ data center operations in generating revenueand/or facilitating e-commerce activity.On the surface, these findings may appear to be benignexamples of how “siloed” work groups can promotedisconnects in how common issues are perceived. However,for companies whose profitability is directly tied to theavailability of enterprise IT operations, they can lead todramatic increases in adverse risk for the profitability, andpotentially the viability, of a business.By bridging the perception gap between C-suite executivesand rank-and-file IT staff, companies will be betterpositioned to maximize the availability of critical ITapplications without overly inflating a data center’s total costof ownership. In addition to ensuring the entire organizationhas an accurate perception of the state of its data centerinfrastructure, it is critical employees at all levels of theorganization have a thorough understanding of the truefinancial implications of downtime.These alarming misperceptions about the frequency andimpact of data center downtime events triggered thecommission of a second study to determine andbenchmark the average cost of data center downtimein the United States.3

TWO STAGE POWER DISTRIBUTIONMethodology:Benchmarking the Cost of DowntimeData Center Professionals from 41 independent facilitiesacross the country – spanning a variety of organizationalresponsibilities – were asked to participate in the study.Participating data centers represented a wide variety ofindustry segments, including financial services,telecommunications, retail (conventional and e-commerce),health care, government and third-party IT services. Toensure that costs were representative of an averageenterprise data center, participating data centers wererequired to have a minimum square-footage of 2,500 ft2.2%2%2% lic sectorIndustrialServices7%5%10%Conventional retailTechnology & softwareEducationE-commerce retailCollocation servicesFinancial servicesHealthcareFigure 1: Distribution of participating organizations by industry segment.Representatives from all levels of the IT staff were asked toparticipate in the study, including:yyFacility ManagersyyChief Information OfficersyyData Center Management PersonnelyyChief Information Security OfficersyyIT Compliance Leaders4Respondents provided direct, indirect and opportunity costestimates (separately) for a single recent outage based onprovided range variables. To ensure reported losses includedin the study are as comprehensive as possible, follow upinterviews also were conducted to obtain additionalinformation about further revenue losses resulting from datacenter outages.Quantifying the Cost of Downtime12%7%To calculate the comprehensive cost of data centerdowntime, researchers used an activity-based costing modelwhich took into consideration direct, indirect andopportunity costs. As shown in Figure 2, costs werecategorized according to internal activity centers andexternal cost consequences.The study, completed in 2011, uncovered a number of keyfindings related to the cost of downtime. Based on costestimates provided by survey respondents, the averagecost of data center downtime was approximately 5,600 per minute.Based on an average reported incident length of 90 minutes,the average cost of a single downtime event wasapproximately 505,500. These costs are based on avariety of factors, including but not limited to data loss orcorruption, productivity losses, equipment damage, rootcause detection and recovery actions, legal and regulatoryrepercussions, revenue loss and long-term repercussions onreputation and trust among key stakeholders.Though direct costs accounted for nearly one third of allcosts reported, indirect and opportunity costs – significantlymore difficult to perceive for rank-and-file staff – proved tobe significantly more costly, accounting for more than 62percent of all costs resulting from data center downtime.While business disruption and lost revenue were cited as themost significant cost consequences of downtime, other lessobvious costs such as losses in end-user and IT productivityalso had a significant impact on the cost of an averagedowntime event (Figure 3).Surprisingly, equipment costs were among the lowest costsreported for a downtime event, averaging approximately 9,000 per incident. This means that the residual,downstream effects of a data center outage often are farmore costly than the costs to detect and remedy theroot cause of an outage after it has already occurred.

Activity CentersCost ConsequencesDetectionEquipmentContainmentIT ProductivityActivity-basedcosting modelRecoveryUser ProductivityThird PartiesEx-post responseLost RevenueDirect costsIndirect costsBusiness DisruptionOpportunity costsFigure 2: Activity-based cost framework.Business disruptionLost revenueEnd-user productivityIT productivityDetectionRecoveryEx-post activitiesEquipments costsThird parties 179,827 118,080 96,226 42,530 22,347 20,884 9,537 9,063 7,008 - 40,000 80,000 120,000 160,000 200,000Figure 3: Average cost of unplanned data center outages for nine categories.When considering that the typical data center in the UnitedStates experiences an average of two downtime events1over the course of two years, the costs of downtime for anaverage data center easily can surpass 1 million in less thantwo years’ time.Other key findings from the study included:For enterprises with revenue models that depend solely onthe data centers’ ability to deliver IT and networkingservices to customers – such as telecommunications serviceproviders and e-commerce companies – downtime can beparticularly costly, with the highest cost of a single eventtopping 1 million (more than 11,000 per minute).yyThe average recovery time from a total outage was morethan twice that of a partial outage (134 and 59 minutes,respectively).In total, the cost of the most recent downtime events forthe 41 participating data centers totaled 20,735,602.1Downtime events are not limited to total data center outages.Rack- and row-leveloutages also are factored-in to this aggregateas well as associated downtime costs.yyTotal cost of both partial and total unplanned outagescan be a significant expense for organizations(approximately 258,000 and 680,000 per event onaverage, respectively).yyTotal cost of outages is systematically related to theduration of the outage and the size of the data center.yyThe leading (and most costly) root causes of downtimereported by respondents were directly related tovulnerabilities in the data center’s power andcooling infrastructures.5

The Cost of Infrastructure VulnerabilityIn addition to revenue costs associated with downtimeevents, a variety of costs are directly associated with theresponse activities necessary for restoring service andidentifying and addressing the root-cause(s) of the outage.As such, respondents were asked to cite the specific rootcause(s) of the most recent outage at their organization aswell as all costs associated with identifying and remedyingthe root cause to restore data center operations.As evidenced by Figure 4, while a variety of root causeswere cited by survey respondents – including UPS systemfailure (battery), water incursion and IT equipment failures –the majority of root causes can be attributed tovulnerabilities in the data center’s power and coolinginfrastructure. These root causes closely mirrorthose identified by respondents to the initial PonemonInstitute study.As explored in the Vertiv white paper “Addressing theLeading Root Causes of Downtime,” many of the leadingroot-causes of downtime can be attributed to a variety offactors – chief among them being the need to “get morefrom less.” As demands to increase performance andefficiency increased amidst the recent national economicOtherIT S systemfailure(battery)5%10%29%12%15%Water, heator CRAC failure24%Accidental/Human errorFigure 4: Primary root causes of reported unplanned outages.recession, data center managers began implementing designstrategies that achieved these gains at the cost of exposingcritical vulnerabilities in their infrastructures.Fortunately, the risk of many of the leading root causes ofdowntime can be minimized by observing best practices ininfrastructure design and system redundancy, as well asimplementing a comprehensive preventive service andmaintenance regimen.In the following sections, this paper will further examine thecosts incurred by vulnerabilities in respondents’ power andcooling infrastructures as well as actions and best practicesthat can be implemented to minimize recovery costs as wellas the overall risk of downtime2.Power-Related OutagesAccording to survey respondents, more than 39 percent ofdata center outages reported were attributed directly tovulnerabilities in the data center’s power. Among the generalroot causes of downtime related to power, UPS relatedfailures (including batteries) proved to be the mostcostly ( 687,700) followed by generator failures ( 463,890).One of the primary reasons power vulnerabilities are socostly for data centers is that a failure in the powerinfrastructure will likely result in a catastrophic, totalunplanned outage. This means that in addition to any directcosts incurred to remedy the cause of the outage, indirectand opportunity costs also will be significant due to the factthat all stakeholders will be affected by the outage.By definition, Tier I and II data center facilities are notequipped with the technologies needed to isolate a powersystem failure, such as redundancy, dual power paths andstatic switches. As a result, the availability of these datacenters’ power infrastructures is wholly dependent on theintegrity of the facility’s single backup system.Because Tier I and II data centers can do relatively little toprevent the indirect and opportunity costs incurred by atotal data center outage caused by a power failure, makinginvestments that minimize the impact of a power systemfailure on data center operations is strongly recommended.One of the best ways to do this is to ensure that all powersystems are backed by an adequate level of redundancy.Implementing redundancy allows facility managers toeliminate single points of failure in their power2NOTE: For detailed recommendations for fortifying data center infrastructuresagainst the most common root-causes of downtime, please refer to thecompanion white paper “Addressing the Leading Root Causes of Downtime:Technology Investments and Best Practices for Assuring Data Center Availability.”6

infrastructures. Because there is always a possibility ofequipment failure over time, redundancy ensures that abackup is always in place. While direct costs would still beincurred to repair or replace the failed module, theequipment failure would not have a catastrophic impact ondata center availability, and thus the organization would notincur the substantial indirect and opportunity costsassociated with a total unplanned outage.When adding a UPS for redundancy or replacing an existingor failed module, the long-term reliability of the solutionshould be the highest priority. Some UPS systems, includingthe Liebert NXL, also are capable of achieving superiorperformance and availability through redundantcomponents, reduced number of components, faulttolerances for input currents and integrated batterymonitoring capabilities.In addition to establishing redundancy in the powerinfrastructure, adequate service and maintenance forcritical power systems can play a significant role minimizingthe risk of power equipment failure. In fact, even a singleannual preventive maintenance visit can increase the“mean time between failure” (MTBF) of a UPS unit bymore than ten-fold.Finally, the implementation of comprehensive infrastructuremonitoring and management tools such as Liebert Nform,Liebert SiteScan and Alber Battery Monitoring also canminimize the activity costs intrinsic to detecting andrecovering from power system failures. Integrating acomprehensive monitoring solution – including battery andbranch circuit monitoring – allows IT staff to quickly identify,isolate and address power equipment issues.IT equipment failure 750,326 687,700UPS system failure (battery) 612,993Other root causes 489,100Water, heat or CRAC failureGenerator failure 463,890Weather realated 395,065 298,099Accidental/human error 0 200,000 400,000 600,000 800,000Figure 5: Average total cost by root causes of the unplanned outage.7

Environmental-Related OutagesLeading Root Causes of Downtime,” including:Along with vulnerabilities in the power infrastructure,environmental vulnerabilities also accounted for anoteworthy portion of the root-causes cited by surveyrespondents. Fifteen percent of all root causes were directlyattributed to thermal issues, including water incursion and ITequipment failures related to heat density and coolingcapacity. The costs associated with detecting andrecovering from these failures also was significant, atmore than 489,000 per incident.yyMinimizing the risk of water incursion through theuse of refrigerant-based cooling instead ofwater-based solutions.Environmental issues also are a leading cause of ITequipment failures. In fact, though IT equipment failures onlyaccounted for five percent of root causes cited by surveyrespondents, these failures incurred the highest overallcost – more than 750,000.In many cases, a single failure can cause a chain reaction ofIT equipment failures – requiring extensive detection andrecovery efforts to identify the root-cause in addition to thereplacement of affected IT equipment. For example, a chilledwater leak in the data center’s in-row cooling system cancause the failure of sensitive IT equipment. In addition toidentifying and remedying the cooling issue that caused theoutage, servers and other damaged IT equipment will needto be replaced.Also, it is critical to point out that cooling equipment doesnot need to fail to cause an IT equipment failure. Conversely,these failures – typically caused by high heat densities and“hot spots” within the rack – frequently occur as a result ofan inadequate cooling infrastructure rather than acooling equipment failure. This further reinforces theimportance of an optimized cooling infrastructure.yyEliminating hot spots and high heat densities by bringingprecision cooling closer to the load via row-basedprecision cooling solutions.yyInstalling robust monitoring and managementsolutions with remote monitoring functionality.yyFortifying cooling and IT equipment investments withregular preventive maintenance and service visits.While these recommendations embody many of the bestpractices for maximizing the availability, effectiveness andefficiency of the data center’s cooling infrastructure, somevendors, including Vertiv, now offer facility managers theability to implement an integrated solution optimized forefficient, high-availability power and cooling performance.These solutions offer all of the aforementioned design bestpractices, some with the additional benefit of rapiddeployment for data center expansion or disaster recovery.These integrated solutions also offer the added benefit ofefficient precision cooling through cold-aisle containment(See Figure 6), maximizing the effectiveness of theintegrated cooling solution. These characteristics play acritical role in focusing cooling based on the real-time needsof the equipment housed within the racks, minimizing therisk of hot spots and other faults common in high densitycomputing environments while operating at a highlevel of efficiency.While some outages relating to the data center’s coolinginfrastructure may be more isolated than power-relatedfailures – contributing to both total and partial data centeroutages – a comprehensive cooling infrastructure remainscritical to minimizing downtime events and their associatedcosts. This is particularly true considering the manyconnections between a data center’s cooling infrastructureand the viability of critical IT equipment – where coolingsystems do not need to fail to cause catastrophic failuresand damage sensitive and costly equipment.Fortunately, there are a number of best practices andinvestments that can be made to a data center’s coolinginfrastructure to minimize the risk of catastrophic equipmentfailures and associated downtime events. Many of these bestpractices are explored in the white paper “Addressing the8Figure 6: Data center solutions to optimize precision cooling, like SmartAisle fromVertiv, address specific needs with rapidly deployable solutions that costeffectivelyadd data center capacity, improve IT control and increase efficiency.

Making the Business Case for InfrastructureOptimization 3As detailed in the preceding sections, vulnerabilities in adata center’s infrastructure can have a dramatic impact on afacility’s susceptibility to costly downtime events totalinghundreds of thousands of dollars. However, as this paper hasdemonstrated, only 29 percent of rank-and-file IT staffmembers believe that their companies have implementedthe technologies and best practices required to minimize theoccurrence and impact of data center downtime.This disconnect begs the obvious question: If executivesunderstand the role of their data centers in generatingrevenue and sustaining their respective business models,why have many hesitated to make the necessaryinvestments required to fortify their infrastructures againstdowntime? The likely answer is that, prior to quantifying thecost of data center downtime, most executives could notrecognize how downtime prevention speeds the ROI of theirinfrastructure investments.As evidenced by the findings of the Ponemon Institute,downtime can result in a variety of long-term reccurringcosts, which include direct costs associated with identifyingand addressing root causes, as well as indirect costsassociated with disrupting business-critical operations. Whileminimizing the risk of downtime events and their overallfinancial impact may necessitate a significant up-frontCAPEX investment, when considering the gains in direct andindirect downtime costs as well as savings gleaned fromincreases in efficiency that reduce OPEX, select investmentscan actually speed a business’ time-to-ROI while reducing adata center’s total cost of ownership over time.To emphasize this point, one needs only to compare thecost of infrastructure optimization to the average cost andoccurrence of downtime over time. It is important to firstunderstand how the cost of downtime impacts the speed toROI for data center infrastructure investments.3NOTE: Though based on real-world scenarios, the costs detailed in this analysisare approximations of market costs for a reference model data center (presentedin Appendix A). To obtain a detailed estimate for optimizing your specific datacenter infrastructure in accordance with the below recommendations, pleasecontact your Vertiv Representative.Power Infrastructure OptimizationFirst, consider that a typical unoptimized enterprise datacenter experiences an average of ten downtime events overa period of ten years, spanning a variety of root causes. Atan average per-event cost of just over 500,000 (includingdirect costs, indirect costs and opportunity costs), a typicalenterprise data center can incur more than 5 million indowntime costs during this time.UPS system failure costs accounted for 29 percent of datacenter outages reported by survey respondents.Extrapolated over ten years, these data centers can expectto incur at least three downtime events related to UPSsystem failure, at an average total cost in excess of 2 millionin total downtime costs.Compare this figure to the approximate costs associatedwith adding UPS redundancy to a 2,500-square-foot datacenter with 105 high-density racks (1,000 servers) and afacility power draw of approximately 1,200 kW. Adding UPSredundancy to a data center of this size would likely requirean initial capital investment of approximately 250,000 andan annual investment of up to 15,000 for two annualpreventive service visits (increasing the MTBF for UPSsystems by up to 23 times).Based on these numbers, when extrapolating theseinvestments over ten years, the total investment instrengthening this data center’s UPS systems infrastructurewould be approximately 400,000. Compared to the averagetotal cost of downtime events caused by a UPS systemsfailure as reported by respondents ( 687,000), ROI is easilyachieved through the prevention of a single UPS-relateddowntime event. Furthermore, over a period of ten years,ROI can be achieved three-fold in potential downtime costsalone, not considering gains in efficiency and OPEXassociated with reactive service visits.Cooling Infrastructure OptimizationA similar analysis can be conducted with regard to theoptimization of a data center’s cooling infrastructure. Datacenter outages related to failures or inadequacies of criticalcooling systems accounted for approximately 20 percent ofreported outages, including IT equipment failures.Collectively, the average cost of these root causes wasapproximately 554,000. This means that if an average datacenter experiences ten downtime events over a period of tenyears, an average of two events (with an average total costof more than 1.1 million in downtime costs) will be related tovulnerabilities in the data center’s cooling infrastructure.9

To contrast these costs with the cost of infrastructureoptimization, one can revisit the aforementioned “model”data center. In this case, the model data center is assumedto rely on eight chilled-water based cooling solutionsservicing load from the data center’s IT equipment, UPS andPDU systems, as well as building egress and human load.solution), data center managers and dramatically enhancethe effectiveness of their cooling equipment with the addedbenefit of significant gains in energy savings. The addition ofintelligent controls (Liebert iCOM ) and remote monitoringto a contained infrastructure (approximately 80,000 for thebaseline data center presented in Appendix A) can furtherenhance cooling efficiency by at least 12 percent and ensurethat all IT equipment is being adequately and preciselycooled based on real-time heat densities (see Figure 7).Finally, investing in ongoing preventive maintenance andservice for the equipment (an approximate annualinvestment of 2,000) and installation of a comprehensiveleak detection solution for all cooling units (approximately 5,000) is recommended.Based on these parameters, it is strongly recommended thatdata center managers invest in an assessment of their datacenter space. These services can range from a data centeraudit performed by trained service representative (often freeas part of an existing service agreement) or a morecomprehensive thermal assessment complete with CFDmodeling (approximately 12,000 for the baseline datacenter in Appendix A) which unveils a clear picture ofvulnerabilities in a data center’s cooling infrastructure andareas where significant efficiency gains can be achievedthrough cooling optimization. Often, such assessmentsconclude that additional equipment investments can bepostponed by optimizing the configuration of coolingsystems, racks and IT equipment.Over ten years, the total investment in strengthening thisdata center’s cooling infrastructure would be approximately 135,000 ( 115,000 in year one). Compared to the averagetotal cost of a single downtime event caused by IT systemsfailure or thermal-related outages as reported byrespondents ( 554,000), these investments can easily bejustified if they prevent even a single thermal-relateddowntime event.54 oFCompressorCondenserEvaporator FanTotalSavings62 oFIT RacksIT Racks94oFPrecisionCooling62 oFIT RacksIT RacksPrecisionCoolingIT RacksIT Racks54 oF89 oF92 oF85oF75 oF97 oFPrecisionCoolingBy optimizing a data center’s existing cooling infrastructurevia a co

center outages. Quantifying the Cost of Downtime The study, completed in 2011, uncovered a number of key findings related to the cost of downtime. Based on cost estimates provided by survey respondents, the average cost of data center downtime was approximately 5,600 per minute. Based on an average reported incident length of 90 minutes,