HP Service Health Analyzer: Decoding The DNA Of IT Performance Problems .

Transcription

HP Service Health Analyzer: Decoding the DNA ofIT performance problemsTechnical white paperTable of contentsIntroduction .2HP unique approach—HP SHA driven by the HP Run-time Service Model .2HP SHA—runtime predictive analytics .5Product capabilities.7Get started with zero configuration and zero maintenance .7Return on investment .11Conclusion.12

IntroductionMaking sure you have complete visibility into the health of your business service, that you can adapt, and evensurvive, in today’s cloud and virtualized IT environment isn’t just a “nice-to-have.” It is mandatory. Managing adynamic infrastructure and applications will take more than just reacting to business service problems when theyoccur, or manually updating static thresholds that are difficult to set accurately and problematic to maintain.In today’s world, you need advanced notification of problems so you can solve those issues before the business isimpacted. You need better visibility into how your applications and business services are correlated with yourdynamic infrastructure, so you can track anomalies across the complete IT stack, including the network, servers,middleware, applications, and business processes. You need an easier way of determining acceptable thresholds asa basis for identifying events that might impact the business. You need automation to leverage the knowledge frompast events that can be applied to address new events more efficiently and can also be used to suppress extraneousevents allowing IT to focus on just the business-impacting events.While IT organizations have the methods to collect massive amount of data, what has been lacking is the analytictool set and automated intelligence to correlate these disparate metrics from both an application and a topologyperspective to help these organizations anticipate or forecast potential problems on the horizon. IT managersare looking into the world of predictive analytics, one of the notable business intelligence trends of 2011, to helpthem improve service uptime and performance, thereby increasing business-generated revenue and decreasingmaintenance and support cost.HP Service Health Analyzer (SHA) is a predictive analytics tool built on top of a real-time, dynamic service model soyou can understand the relationship of metric abnormalities with the application and its underlying infrastructure.HP unique approach—HP SHA driven by the HP Run-timeService ModelMonitoring systems provide measurements and events from all layers of the IT stack—hardware, network OS,middleware, application, business services, and processes. Configuration management databases (CMDBs) providethe model that links all of the different components. But, given the ever changing nature of IT systems, CMDBs’ needto be constantly updated, such as in the case of HP Run-time Service Model (RtSM). The combination of the monitorsand the real-time CMDB provide all the necessary data to meet the challenges above. However, all of the data needsto be transformed to provide actionable information. HP SHA uses advanced algorithms that combine multipledisciplines, topology, data analytics, graph theory, and statistics in the Run-time Anomaly Detection (RAD) Engine.HP solution to the outdated service model is our RtSM. The RtSM synchronizes with the HP UCMDB to leverage theservice modeling in the “external” Universal Configuration Management Database (UCMDB). The RtSM thenleverages the data collectors of HP Business Service Management (Business Service Management) portfolio that aremonitoring for performance, availability, fault, and topology to share “real-time” topology so that the RtSM has themost current understanding of topology and relationships. The RtSM is a core foundation to SHA.For more information on how the RtSM works with the UCMDB, see the “RtSM best practices guide”2

Figure 1. Solution templateFigure 1 outlines the components of SHA that we determined are required for an accurate solution to decode ITperformance problems. We now outline the components and their requirements.Baselining is the first component, which takes every metric collected by monitoring systems and learns its normalbehavior. Deviations from normal behavior by the metric serves as the first step for detecting, predicting, anddecoding performance problems. However, learning the normal behavior of metrics accurately is a challenging task.Factors such as seasonal behavior, trends, and changes due to an ever evolving IT system require the learningalgorithm estimating the baseline to be adaptive and aware of these factors. Figure 2 shows the distribution of theseason for over 17,000 performance metrics collected from a real IT system. These are a combination of system,application, and user-level monitors. As can be seen, over two thirds of the metrics display some seasonal behavior,and these represent a range of various seasons, not just the typically assumed daily or weekly seasonality. A baselinealgorithm must first estimate the season to be accurate—for example, if a metric has a seasonal behavior of fivehours, and a baseline algorithm ignores the season or uses a predetermined season that is incorrect (for example,24 hours), it will produce a poor baseline. The baseline will be too sensitive, producing many false deviations fromnormal, which are actually normal, or be too indiscriminate and will not detect deviations from normal behaviorwhen those exist.3

Figure 2. Distribution of seasonal behavior for over 17,000 metrics collected from an IT environmentSimilarly, estimating trend and being adaptive to changes are important for estimating a good baseline.While understanding the normal behavior of individual metrics is important, it is not sufficient to detect and predictreal problems. By definition, some of the deviations from the baseline will not be related to any problem (a smallfraction); in a large IT environment with millions of metrics, even this small fraction can lead to too many false alerts,if treated individually as a problem. In addition, problems typically do not manifest themselves on a single metric inthe environment.Temporal analysis: It is one of the widespread approaches to combine metrics into a single anomaly. Temporal analysismethods include metric-to-metric correlations, where metrics are grouped together based on the similarity of their time-seriesmeasurements, or multivariate temporal analysis/prediction that combines multiple metrics together through a, typicallylinear, multivariate mathematical model, such as multivariate regression, neural, and Bayesian models.These methods are powerful but have their limitations. First, they scale poorly with the number of metrics. Second,given their statistical nature, they can find misleading correlations if they are provided a very large number of metricsthat have no real relationship between them; the chance of finding such wrong correlations increases with the numberof metrics.Topology analysis: What helps temporal methods overcome their limitations is domain-related context. In particular, inIT environments, the set of metrics being analyzed should be limited to a logical set of related metrics. If the CPUs oftwo completely unrelated servers become high at the same time, they should not be considered correlated even ifstatistically they appear to be. Such context is provided in the topology of IT systems, through CMDBs. A CMDB isessentially a graph, modeling the relationships between all components making up IT systems—the physical,middleware, software, application, business services, and processes layers. Topology analysis, in the form ofadvanced graph algorithms, is therefore required for extraction of the contextual information within the CMDB, and tohelp detect real problems and correlations between metrics, while filtering noise.4

Therefore, detecting a real problem requires detection of patterns of deviations from normalcy of multiple metricsthat span time and filtered by topology. This leads to statistical learning methods that analyze temporal andtopological data.Historical analysis: Beyond detection and prediction of a problem, the topology provides the ability to scope theproblem and separate root cause from symptoms; both are important for quickly resolving the problems. With aproblem detected and analyzed, its DNA pattern is finally decoded, and it can be stored in a knowledgebase. Toleverage the knowledgebase, algorithms that perform historical analysis are required. These include algorithms formatching and comparing different problem DNA patterns, clustering them, and classifying techniques. With theknowledgebase and algorithms in place, past problems can be leveraged quickly and automatically to help find rootcause and resolutions to new problems.RAD Engine: It is defined by this complete set of algorithms. The algorithms within the RAD Engine are the subjectof 10 separate patent applications. The output of the RAD Engine is a critical key performance indicator (KPI) in theHP BSM dashboard, and it send an event into the BSM event subsystem, HP Operations Manager i (OMi). The eventfrom SHA contains a wealth of contextual information gathered by the RAD Engine including the lead suspects,location information, business impact information, a list of the configuration items (CIs) involved in the anomaly, andany similar anomaly information. This information will help customers isolate and resolve the event quickly before thebusiness is impacted.HP SHA—runtime predictive analyticsIn SHA we developed statistical learning algorithms coupled with graph algorithms, for analyzing the full spectrum ofdata collected by BSM systems: Monitoring data (synthetic and real user) Events Changes Topology from the RtSMThese algorithms accurately detect anomalies, decode their DNA structure, their business impact, and match them topreviously decoded anomalies, collected in our Anomaly DNA Knowledgebase.SHA can be described in the following steps: Metric behavior learningLearning the normal behavior, also known as baselining, of the metrics collected from all levels of the service(system, middleware, application, and others) is a necessary first step. It removes the need to set static thresholdsand enables early detection of deviations from normalcy. Key strengths of our algorithms are:– Automatic learning of the metric seasonal behavior and its trend– Adaptive to behavioral changes over time―a must in virtualized environments– Configuration free—no administrative effort to set or maintain thresholds is required Anomaly DNA Technology—detectionAs a holistic problem evolves in an IT service, numerous metrics and components related to that service begin toexperience deviations from normal behavior. However, there are constant momentary deviations from normalcy byvarious components that do not represent any meaningful problem. Selecting the meaningful problems anddiscovering the DNA of true problems is the challenge of any anomaly detection system. Our anomaly DNAdetection algorithm accomplishes this using a unique statistical algorithm, which combines three types ofinformation, necessary for achieving accurate detection:– Topological: logical links between monitors and the components they monitor– Temporal information: the duration and temporal correlation of the monitors being in an abnormal state– Statistical confidence information: the probability of the monitor to be truly in an abnormal state, as learned bythe baseline over time5

Key strengths of our anomaly detection algorithm are:– Clutter reduction: Provides an automatic method to group metrics that breached their baseline, using bothtemporal and topological information. This in turn reduces the number of events of baseline breaches that anoperator would have to look at, without having to set any rules.– Event reduction: The algorithms of SHA combine multiple abnormal metrics into a single event reducing the totalnumber of events presented to an operator. The entry point of this type of event is multiple metrics breaching theirdynamic thresholds. Then, SHA correlates these metrics by time and topology to generate a single event allowingthe operator to focus on the real issue.– False alarm reduction: Reduces the number of false alerts by computing the significance of an anomaly in thesystem using a statistical algorithm. Also, known anomalies that have been marked as noise in the past will beused to match current anomalies and suppress the anomaly event. Anomaly DNA Technology—decodingThe next step following the detection of the anomaly and its structure is decoding of its DNA. Decoding theanomaly DNA is done by analyzing and classifying it based on the topology (CIs and their topological structure),the metrics and additional information. In particular, the decoding achieves:– Separation of suspects, thus providing actionable information. Identification of business impact using businessrelated information: user volume, service-level agreements (SLAs), and affected geographical areas, thusallowing prioritizing the anomaly according to the impact– Identification of related changes that may have affected the system behavior Anomaly DNA Technology—matchingWith the anomaly DNA structure decoded, matching of the current anomaly with past anomalies is performed. Thematching is performed with a unique graph similarity algorithm, which compares abstract anomaly structures, thusallowing matching between anomalies that were detected on different services that have a similar architecture. Theadvantages of our matching are:– Enables reuse of discovered solutions of past events.– Matches to anomalies of known issues that are yet to be resolved, reducing the need to reinvestigate– Reduces false alarms when the past similar anomaly was classified as noisy DNA structures, for example ananomaly that is caused by normal maintenance actions on the service Anomaly DNA KnowledgebaseAs the knowledge base of past anomalies and their resolutions is collected, using advanced data mining methodsanalyzes and generates the relationship between all anomalies, thus creating a map of the entire Anomaly DNAKnowledgebase. Our anomaly DNA matching algorithm defines the required metric space for data miningmethods such as clustering and classification. These are applied to provide the following benefits:– Proactive problem resolution―identification of recurrent problems through anomaly DNA classification toproblem and resolution types, reducing the time to diagnose and resolve these types in the future– Leveraging knowledge collected from various services that display similar behavior6

Product capabilitiesBuilt upon HP RtSM, HP SHA analyzes historical norms and trends of both applications and infrastructure, andcompares that data against real-time performance metrics. Leveraging a run-time service model is crucial for yourdynamic environment so you can: Correlate anomalies to topology changes and past issues Understand the business impact of each issue and prioritize resolution Identify the suspects of the issue and use that knowledge to prevent similar issues in the futureSHA automatically learns the dynamic thresholds in your environment so you don’t have to invest the labor to set andmaintain static thresholds. SHA works on metrics from the following BSM data sources: HP Business Process Monitor HP Diagnostics HP Network Node Manager i HP Operations Manager, Performance Agent HP Real User Monitor HP SiteScopeSHA identifies anomalies based on abnormal metric behavior related to the RtSM, sets a KPI, and generates an eventwith context to help identify business priority of this issue. Additionally, SHA uses Anomaly DNA Technology toanalyze the structural makeup of an anomaly, and it compares that with the known DNA of other anomalies. Matchesprovide known remediation actions without further investigation, and matches marked as noise are suppressed. If youhave abnormalities related to a specific service, you can see the SLAs, and know the impact that anomaly couldcause. Finally SHA incorporates remediation capabilities from the HP Closed Loop Incident Process (CLIP) solutionand provides direct integration with HP Operations Orchestration. For instance, you can fuse analytics andautomation together to remediate issues quickly. When SHA sends an event into OMi, an operator can take actionbefore service is impaired with the CLIP process. This quick remediation solution simplifies the complexities ofvirtualization, cloud computing environments.Get started with zero configuration and zero maintenanceAfter you install the product, you select the applications that you want to monitor, and SHA starts collecting data andlearns your system behavior. SHA gathers data from the application, infrastructure, database, network, andmiddleware, as well as topology information from the RtSM, and learns the baseline. The baseline defines the normalbehavior of an individual metric over time, including the seasonal characteristics. For example, normal behavior for ametric may include a very busy Monday morning and a very quiet Friday afternoon.7

Figure 3. Example of a dynamic baseline sleeve in gray band with actual metric data in purple.After you have the dynamic baselines established for all the application metrics, SHA RAD Engine starts looking foranomalies in application behavior. The entry point into the RAD Engine is a baseline breach indicating that a metric isexhibiting abnormal behavior. To define an anomaly, the RAD Engine takes the abnormal metric information gatheredfrom all monitored metrics and couples that with topology information from the RtSM to determine if there are multiplebreaches, from different metrics, affecting the same service. If an anomaly is detected, a event is generated and sent tothe event subsystem. Additionally, when an anomaly is detected SHA automatically captures the current topology of theCIs involved with the event. The value of this is to understand the topology as it was at the time of the anomaly, which isespecially valuable when reviewing anomalies that occurred overnight or when there are no on-call operators to addressthe issues. SHA also collects and presents discovered changes for the relevant CIs so this information can be used aspart of the root cause analysis. This correlation means faster troubleshooting and reduced mean time to repair (MTTR).When SHA discovers an anomaly in the application behavior it changes the status of the Predictive Health KPI andtriggers an event that is sent to BSM event browser. From this point you can start to drill down, isolate the problem,and understand its impact on the business.SHA provides a page with anomaly highlights that contains all you need to know about the problem and its impacton your business, as well as advanced isolation capabilities if you need to drill down and investigate it further.8

Figure 4. An anomaly highlights pageAt the top of figure 4 “An anomaly highlights page,” you can find the “suspects list.” The suspects are CIs(applications, transactions, infrastructure elements) that were found by SHA as the possible cause of the anomaly.Suspects can be CIs whose metrics breached baseline, anomaly patterns that were previously identified by the user asabnormal, and CIs that failed verifications with user provided verification tool.The highlights page also provide the anomaly business impact by presenting which SLAs were breached because ofthe anomaly, the services and applications that were affected, and a breakdown of the locations that were impacted.9

SHA also offers to run relevant reports to drill down and have a better view of the problem. The similar anomaliessection is generated using Anomaly DNA Technology, and it provides more confidence on the occurrence of theproblem by showing a list of similar patterns, and additional information about how they were handled.SHA provides a problem investigation and isolation tool to drill down into the anomaly, and isolate a possible rootcause of the problem with the Subject Matter Expert User Interface (SME UI). The investigation tool allows you to“travel in time” in the anomaly and have a detailed view into the turn of events that lead to the problem as it isreflected in the application topology.The figure below shows an example of an anomaly and its turn of events over time.Figure 5. SME UI showing topology of anomalyThe lower part of the screen shows the events in the system as they occurred and captured by SHA over the timebefore and during the anomaly. At 06:15 a.m. SHA recorded a discovered change in the system. At 06:30 a.m. SHA triggered an anomaly. It means that it detected some abnormal metrics that breached itsbaseline—before SiteScope and OM that were monitoring the system discovered it. At this point of time SHAalready triggered an event that was sent to the operations personnel. At 08:00 – 08:20 a.m. SiteScope and OM triggered events on high CPU usage. The reason why SiteScopeand OM discovered the problem later than SHA, is that their thresholds were set higher than SHA’s dynamicbaseline—to reduce noise and false positive alerts. At 8:30 a.m. the first real-user experienced the performance problem and opened an incident.As you can see, SHA discovered the problem and alerted on it two hours ahead of time and before any of the userscomplained about it—while providing the operations personnel advance notice to handle and resolve it.SHA provides you with a powerful tool to correlate and find out which of the metrics can be the possible root causeof the problem in your system.10

In the figure below you can see SHA metric view that is part of the SME UI.Figure 6. SME UI in metric viewThe metric view allows you to preview your application’s metrics as they were captured during the anomaly timeframe in the “envelop” of their baseline. It also allows you to find out which of the metrics was the root cause of theproblem by correlating it to the other metrics related to the same service using sophisticated statistical algorithms.In this example, the user decided to correlate Real User Monitor (RUM) metric with all the others. The reason why thismetric was picked is that it represents best the real response time that the actual users are experiencing while usingthe application. The rest of the metrics are of infrastructure and middleware components, and the metric viewprovides a point-and-click mechanism to present a correlation between them to poor response time. The metric thatgot the highest correlation value (81 percent) was “Sitescope paging File Usage”, that indicates that the root cause ismost likely insufficient memory allocation.Return on investmentSHA calculates a return on investment (ROI) using information gathered from the deployment environment. The metricmanagement section looks at the ROI from reducing the administrative labor of setting and maintaining thresholdswith the self learned dynamic thresholds that SHA delivers. The events and anomaly section looks at ROI from anevent reduction perspective comparing the current OMi event stream to the anomaly events generated from SHA. Thisinformation is rolled up into the overall efficiency.11

Figure 7. SHA ROI viewConclusionSHA is HP next-generation run-time predictive analytics solution that can anticipate IT problems before they occur byanalyzing abnormal service behavior and alerting IT managers of real service degradation before that issue impactstheir business. SHA delivers tight integration with the HP BSM solutions for event remediation to reduce the MTTR.Additionally, SHA is simple to use, requires minimal configuration and settings, and has a small learning curve.With SHA you no longer have to maintain your monitoring thresholds, as it constantly learns the behavior of yourapplications and adjust them accordingly. It reduces your application MTTR as you get fewer events in yoursystem, each of them represents a real problem, and it is focused on the root cause. And because it is powered bydynamic HP RtSM, SHA can help IT operations identify potential issues across both topology and the services andsolve them before the problem is even experienced by end users.HP SHA is the new era of analytics in IT. For more information, visit www.hp.com/go/sha. Copyright 2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Theonly warranties for HP products and services are set forth in the express warranty statements accompanying such products and services.Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors oromissions contained herein.4AA3-8672ENW, Created December 2011

HP solution to the outdated service model is our RtSM. The RtSM synchronizes with the HP UCMDB to leverage the service modeling in the "external" Universal Configuration Management Database (UCMDB). The RtSM then leverages the data collectors of HP Business Service Management (Business Service Management) portfolio that are