The Essential Guide To AIOps - Splunk

Transcription

The Essential Guide toAIOpsOvercome data chaos and get continuousinsight into your IT Operations

Table of ContentsWhat Is AIOps?.3AIOps Today.4Key AIOps Use Cases.5AIOps and the Shift to Proactive IT.8How to Get Started With AIOps.9Why Splunk for AIOps Is Different .11The Bottom Line: Now Is the Time for AIOps. 13

What Is AIOps?AIOps is the practice of applying analytics and machine learningto big data to automate and improve IT operations. These newlearning systems can analyze massive amounts of network andmachine data to find patterns not always identified by humanoperators. These patterns can both identify the cause of existingproblems and predict future impacts. The ultimate goal of AIOpsis to automate routine practices in order to increase accuracy andspeed of issue recognition, enabling IT staff to more effectivelymeet increasing demands.History and BeginningsThe term AIOps was coined by Gartner in 2016. In the Market Guidefor AIOps Platforms, Gartner describes AIOps platforms as “softwaresystems that combine big data and artificial intelligence (AI) ormachine learning functionality to enhance and partially replacea broad range of IT operations processes and tasks, includingavailability and performance monitoring, event correlation andanalysis, IT service management and automation.”3

AIOps Today Automates response: Identifying and predicting issues isimportant, but AIOps platforms have the most impact when theyalso notify the correct personnel, automatically remediate the issueOps teams are being asked to do more than ever before. In aonce identified or, ideally, execute commands to prevent the issuecommon practice that can sometimes even feel laughable, oldaltogether. Common remedies such as restarting a component ortools and systems never seem to die. Yet the same ops teamscleaning up a full disk can be handled automatically so that staff areare under constant pressure to support more new projects andonly involved once typical solutions have been exhausted.technologies, very often with flat or declining staffing. To top it off,increased change frequencies and higher throughput in systemsoften means the data these monitoring tools produce is almostKey Business Benefits of AIOpsimpossible to digest.By automating IT operations functions to enhance and improvesystem performance, AIOps can provide significant business benefitsTo combat these challenges, AIOps:to an organization. For example: Brings together data from multiple sources: Conventional IToperations methods, tools and solutions aggregate and averagedata in simplistic ways that compromise data fidelity (as anexample, consider the aggregation technique known as “averagesof averages”). They weren’t designed for the volume, variety andvelocity of data generated by today’s complex and connectedIT environments. A fundamental tenet of an AIOps platform is itsability to capture large data sets of any type while maintainingfull data fidelity for comprehensive analysis. An analyst shouldalways be able to drill down to the source data that feeds anyaggregated conclusions. Simplifies data analysis: One of the big differentiators for AIOpsplatforms is their ability to correlate these massive, diverse datasets. The best analysis is only possible with all of the best data.The platform then applies automated analysis on that data toidentify the cause(s) of existing issues and predict future issues Avoiding downtime improves both customer and employeesatisfaction and confidence. Bringing together data sources that had previously been siloedallows more complete analysis and insight. Accelerating root-cause analysis and remediation saves time,money and resources. Increasing the speed and consistency of incident responseimproves service delivery. Finding and fixing complicated issues more quickly improves IT’scapacity to support growth. Proactively identifying and preventing errors empowers IT teams tofocus on higher-value analysis and optimization. Proactive response improves forecasting for system andapplication growth to meet future demand. Adding “slack” to an overwhelmed system by handling mundaneby examining intersections between seemingly disparate streamswork, allowing humans to focus on higher-order problems, yieldingfrom many sources.higher productivity and better morale.4

Data Is Vital for AIOpsData is the foundation for any successful automated solution. YouKey AIOps Use Casesneed both historical and real-time data to understand the past andpredict what’s most likely to happen in the future. To achieve a broadpicture of events, organizations must access a range of historical andstreaming data types of both human- and machine-generated data.According to Gartner, there are five primary usecases for AIOps:Better data from more sources will yield analytics algorithms betterable to find correlations too difficult for humans to isolate, allowingthe resulting automation tasks to be better curated. For example,1. Performance analysisit’s not hard in most semi-modern monitoring systems to automatesome sort of response. However, if response times slow downan application, AIOps would help ensure the correct automated2. Anomaly detectionresponse and not just the “knee-jerk” response that’s staticallyconnected. Adding more capacity to a service may in fact make aslowdown worse if the bottleneck isn’t related to capacity. And itcertainly can result in unintended and unnecessary costs in cloud3. Event correlation and analysisenvironments. Thus, having the right data to make more-completedecisions results in better outcomes.For total visibility, it’s necessary to access data in one place across4. IT service managementall of your IT silos. It’s important to understand the underlying datasupporting your services and applications — defining KPIs thatdetermine health and performance status. As you move beyond data5. Automationaggregation, search and visualizations to monitor and troubleshootyour IT, machine learning becomes key to achieving predictiveanalysis and automation.5

1. Performance analysis:3. Event correlation and analysis:It has become increasingly difficult for ITThe ability to see through an “event storm”professionals to analyze their data usingof multiple, related warnings to identify thetraditional IT methods, even as those methodsunderlying cause of events. The reality of most complex systemshave incorporated machine learning technology. The volume andis that something is always “red” or alerting. It’s inevitable. Thevariety of data is just too large. AIOps helps address the problemproblem with traditional IT tools, however, is that they don’tof increasing volume and complexity of data by applying moreprovide insights into the problem, just a storm of warnings. Thissophisticated techniques to analyze bigger data sets to identifycreates a phenomenon known as “alert fatigue”; teams see aaccurate service levels, often preventing performance problemsparticular alert that turns out to be trivial so often that they ignorebefore they happen.the alert even on the occasions when it’s important.2. Anomaly detection:AIOps automatically groups notable events based on theirMachine learning is especially efficient atsimilarity. Think of this as drawing a circle around events thatidentifying data outliers — that is, events andbelong together, regardless of their source or format. Thisactivities in a data set that stand out enough fromgrouping of similar events reduces the burden on IT teams andhistorical data to suggest a potential problem. These outliersreduces unnecessary event traffic and noise. AIOps focusesare called anomalous events. Anomaly detection can identifyon key event groups and performs rule-based actions such asproblems even when they haven’t been seen before, and withoutconsolidating duplicate events, suppressing alerts or closingexplicit alert configuration for every condition.notable events. This enables teams to compare information moreeffectively to identify the cause of the issue.Anomaly detection relies on algorithms. A trending algorithmmonitors a single key performance indicator (KPI) by comparingits current behavior to its past. If the score grows anomalouslylarge, the algorithm raises an alert. A cohesive algorithm looksat a group of KPIs expected to behave similarly and raises alertsif the behavior of one or more changes. This approach providesmore insight than simply monitoring raw metrics and can act as abellwether for the health of components and services.AIOps makes anomaly detection faster and more effective. Oncea behavior has been identified, AIOps can monitor and detectsignificant deviations between the actual value of the KPI ofinterest versus what the machine learning model predicts.Accurate anomaly detection is vital in complex systems as failuresoften exist in ways that are not always immediately clear to the ITprofessionals supporting them.6

4. IT service management (ITSM): A general5. Automation: Legacy tools often require manuallyterm for everything involved in designing,cobbling information together from multiplebuilding, delivering, supporting and managingsources before it’s possible to understand,IT services within an organization. ITSMtroubleshoot and resolve incidents. AIOps providesencompasses the policies, processes and procedures ofa significant advantage — automatically collecting and correlatingdelivering IT services to end users within an organization.data from multiple sources into complete services, increasingthe speed and accuracy of identifying necessary relationships.AIOps provides benefits to ITSM by letting IT professionalsOnce an organization has a good handle on correlating andmanage their services as a whole rather than as individualanazlyzing data streams, the next step is to automate responsescomponents. They can then use those whole units to define theto abnormal conditions.system thresholds and automated responses to align with theirITSM framework, helping IT departments run more efficiently.An AIOps approach automates these functions across anorganization’s IT operations, taking simple actions that respondersAIOps for ITSM can help IT departments to manage the wholewould otherwise be forced to take themselves. Take for example aservice from a business perspective rather than managingserver that tends to run out of disk space every few weeks duringcomponents individually. For example, if one server in a poolhigh-volume periods due to known-issue logging. In a typicalof three machines encounters problems during a normal-loadsituation, a responder would be tasked with logging in, checkingperiod, the risk to the overall service may be considered low, andfor normal behavior, cleaning up the excessive logs, freeing upthe server can be taken offline without any user-facing impact.disk space and confirming nominal performance has resumed.Conversely, if the same thing were to happen during a high-These steps could be automated so that an incident is createdload period, an automated decision could be taken to add newand responders are notified only if normal responses have alreadycapacity before taking any poor-performing systems offline.been tried and have not remedied the situation. These actions canIn addition, AIOps for ITSM can help: Manage infrastructure performance in a multicloudenvironment more consistentlyrange from the simple, like restarting a server or taking a serverout of load-balancer pools, to more sophisticated, like backing outa recent change or rebuilding a server (container or otherwise). Make more accurate predictions for capacity planning Maximize storage resource availability by automaticallyadjusting capacity based on forecasting need Improve resource utilization based on historical dataand predictions Manage connected devices across a complex network7

AIOps automation can also be applied to: Servers, OS and networks: Collect all logs, metrics,configurations and messages to search, correlate, alert andreport across multiple servers. Containers: Collect, search and correlate container data withAIOps and the Shiftto Proactive ITother infrastructure data for better service context, monitoringOne of the primary benefits of AIOps is its ability to help ITand reporting.departments predict and prevent incidents before they happen, Cloud monitoring: Monitor performance, usage and availabilityof cloud infrastructure. Virtualization monitoring: Gain visibility across the virtualstack, make faster event correlations, and search transactionsspanning virtual and physical components. Storage monitoring: Understand storage systems in contextwith corresponding app performance, server response timesand virtualization overhead. Application monitoring: Identify application service levels andsuggest or automate response to maintain defined service levelobjectives.rather than waiting to fix them after they do. AIOps, specifically theapplication of machine learning to all of the data monitored by an ITorganization, is designed to help you make that shift today.By reducing the manual tasks associated with detecting,troubleshooting and resolving incidents, your team not only savestime but adds critical “slack” to the system. This slack allows you tospend time on higher-value tasks focused on increasing the qualityof customer service. Your customer experience is maintained andimproved by consistently maintaining uptime.AIOps can have a significant impact in improving key IT KPIs,including: Increasing mean time between failures (MTBF) Decreasing mean time to detect (MTTD) Decreasing mean time to investigate (MTTI) Decreasing mean time to resolution (MTTR)IT organizations who have implemented a proactive monitoringapproach with AIOps have seen significant improvement in a varietyof IT metrics, including: 15 - 45%70 - 90%10 - 15%High PriorityIncidentsIncidentInvestigation TimeTime to Market forNew Applications8

How to Get StartedWith AIOpsIt’s also important to give IT teams the time to work on building,maintaining and improving systems. This vital work can’t be assignedas a side project or entry-level job if you expect meaningful change.Put your best people on it. Make it a high priority so other work can’tinfringe on it. AIOps practices are iterative and must be refinedThe best way to get started with AIOps is an incremental approach.over time; this can only be done with mature and consistent focusAs with most new technology initiatives, a plan is key. Here are someon improvement.important considerations to get you started.You’ll also need to re-examine and adjust previously manualChoose Inspiring Examplesprocesses that had multiple levels of manager approval, likeIf you’re evaluating AIOps solutions, platforms and vendors for yourrestarting a server. This requires trust in both technology and teamorganization, you’ve got a big task ahead of you. The most challengingpractices. Building trust takes time. Start with simple wins to buildaspect may not be the evaluation process itself, but gaining thecultural acceptance of automation. For example, be prepared tosupport and executive buy-in you need to conduct the evaluation.build historical reports that show previous incidents were correctlyIf you choose inspiring examples of other, similar organizations thathandled by a consistent, simple activity (such as a restart or diskhave benefited from AIOps — and have metrics to prove it — you’llcleanup) and offer to automate those tasks on similar future issues.have a much easier time getting the go-ahead. A good partner canChoose a solution that allows for “automation compromise” byhelp you do that. (See Select the Right Partner below.)inserting approval gates for certain activities. Over time, those gatesConsider People and Processshould be removed to improve speed as analytics proves its value inselecting correct automation tasks.It’s obvious that technology plays an important role in AIOps, butit’s just as important to make a plan to address people and process.Finally, include in your plans a campaign to reassure staff that AIOpsFor example, if an AIOps solution identifies a problem that’s about tois not intended to replace people with robots. Show them how AIOpshappen and pages a support team to intervene, a responder mightcan free up key resources to work on higher-value activities — limitingignore the warning because nothing has actually happened yet. Thisthe unplanned work your teams have to endure each day.can undermine trust in the AIOps solution before it has a chance to beproven in operation.9

Unlock Your DataIngesting and analyzing all of the data effectively and quickly can beEnabling AIOps requires access to all types of data: unstructureddaunting. Instead, start by accessing and analyzing raw historicalmachine data and structured metrics, as well as relational data formachine and metric data to establish a base understanding, and useenrichment. Consider data not only by its type but also by its positionclustering algorithms and analytics to identify trends and patterns.in “the stack,” starting with infrastructure and moving upward to theRaw data is best for real-time detection. Then, you can begin toapplication and finally the business application. You want data fromanalyze streaming data to see how it fits those patterns, applyingeach layer.artificial intelligence that’s powered by machine learning to introduceTradMoitionni t al Ior TOing p sBusMoinesnit s-Vor aluing eautomation and, eventually, predictive analytics.Exec ss EfficiencyCall CenterAPMInfrastructureCustomer RetentionMobileWebSyntheticsAPINetworkApp LogsCloudStorageMiddlewareHardwareSyslogsDBVMOS (Win/Linux)Historical data is extremely valuable as you get started with AIOps. Ifyou start by analyzing and understanding past states of your systems,you will be able to correlate what you learn with the present todevelop meaningful service level thresholds.To achieve this, organizations must ingest and provide access to avast range of historical and streaming data types. The data type thatyou select — which could be anything from logs, metrics and text towire and social media data — depends on the problem you’re solving.For example, you can use metric data from your infrastructure tomonitor capacity, or application logs to ensure that you are providingThese different data types allow you to construct a holisticperspective across all silos and take actions meaningful to thesituation and data type. Your goal is to identify data sources at eachlayer of the service beginning with infrastructure (cloud or traditional)and moving up to application performance, finally tying identifiablebusiness outcomes (such as customer satisfaction, revenue, numberof orders, wait times and so on). Pick a very small number of sources(one or two) at each level and begin by correlating those.an outstanding experience to your customers.Many AIOps platforms have historically focused on a single datasource. Restriction to a single data type limits your insights intosystem behavior — regardless of whether those insights comefrom an IT admin or an algorithm. Hence, enterprises should selectplatforms that are capable of ingesting and analyzing data frommultiple sources.10

Select the Right PartnerAs interest in AIOps has grown, some vendors are packagingtraditional IT operations tools together, adding basic AI features andcalling the result an AIOps “platform.” A true AIOps platform isn’t just aWhy Splunk forAIOps Is Differentcollection of tools. This is important to understand as you get started,because the platform you choose will determine your success.Splunk makes it easy to ingest almost any kind of data from almostGartner recommends that enterprises “prioritize those vendorsany source, real-time or historical, and then apply advancedthat allow for the deployment of data ingestion, storage and access,analytics — predictive analytics, prediction and forecasting, eventindependent from the remaining AIOps components.” You need amanagement and analytics, clustering, adaptive and statisticalplatform that can gather all the necessary data at full fidelity, not justthresholding, anomaly detection, root cause determination and more.aggregations or rollups. You need a platform that can then enrich,This unique approach helps enhance a broad range of IT operationsanalyze and crunch that data to meaningful conclusions and insightsand tasks and allows companies to get value not possible with human(and without requiring heavy amounts of custom work to configure oranalysis alone.maintain). And you need a platform that integrates proper automationto take the right action at the right time as a tight ecosystem.Look at feature sets, and also review customer case studies andAIOps use cases. The easiest way to know if an AIOps platform couldmeet your needs is to find customer case studies that show how acompany similar to yours addressed their business challenges withAIOps. Look for vendors who showcase their customers online andask for customer references. If an AIOps tool or platform promisesgreat results but the company can’t provide evidence, that should bea clue to look elsewhere.11

A Differentiated Approach to Data —Data-to-EverythingSplunk’s AIOps platform is the only one built with the power of Splunk,the Data-to-Everything Platform — empowering customers to use thedata explosion as an opportunity to drive effectiveness, productivity,Differentiators Flexible and scalable solution with AI and ML at its core Result: predict service degradation up to 30 minutesin advance Simplified event management and incident response with AIOpsinsights and automation — to turn data into action, anywhere incapabilities like dynamic thresholding and anomaly detectionthe organization. Result: decrease event noise by 95% Monitoring and insights across infrastructure, apps and servicesEven the best machine learning capabilities become powerlesswithout the right data to support them. The rise in complexity, caused Result: monitoring and service performance health views for ITand business servicesby the rapid growth in data volumes generated by IT infrastructureand applications, the increasing variety of data types, and theKey Capabilitiesincreasing velocity at which data is generated, is met with opposing Event Management and Analysisforces of cost reduction — making it challenging for IT operations toadequately get their jobs done, let alone leverage the best analyticsfor transformation. Instantly group and correlate events to quiet the noise Thresholding Account for and adapt to regular patterns in businessactivity and dataA differentiated approach to data can make all the difference betweendabbling with features and achieving true success and transformation.As the Data-to-Everything Platform, on-prem or in the cloud, Splunkcan ingest nearly any kind of data, like logs, metrics, text, wire, API, andeven social-media derived, from nearly any tool and system. Splunkcan ingest this data as structured, semi-structured or unstructured,and do all this either historically or in real time.Imagine a single platform that unifies all your disparate data acrossall of your silos — and then imagine what AI and ML could do. Imagine Root Cause Determination Mirror IT and business environments for faster investigationand identify top contributing KPIs Anomaly Detection Pinpoint deviations from past behaviors to identifyunusual events Predictive Analytics Predict health scores and forecast trends toprevent incidentsteams no longer burdened with too many alerts, complex tools orsiloed views, and imagine teams that get ahead of problems beforethey happen.The Data-to-Everything Platform gives you the ability to supplyyour AIOps platform with all the data it needs to solve an enormousvariety of IT challenges. Any other AIOps offering can only provide apartial solution.12

Benefits of Splunk Reduce noise and complexity Simplify incident detection with automated alertsand mobilization Apply artificial intelligence, and machine learning capabilitiesacross all ITOps functions, for flexible and scalable solutionsthat grow with your organization Predict outages before they impact customers Use predictive cause analysis on data across services, apps,and infrastructure Predict service degradation 30-40 minutes in advance throughThe Bottom Line:Now Is the Timefor AIOpsIf you’re an IT and networking professional, you’ve been told over andover that data is your company’s most important asset, and that bigdata will transform your world forever. Machine learning and artificialintelligence will be transformative and AIOps provides a concreteadaptive thresholds, anomaly detection and service healthway to leverage its potential for IT. From improving responsiveness toprediction algorithmsstreamlining complex operations to increasing productivity of your 360 visibilityentire IT staff, AIOps is a practical, readily available way to help you Complete visibility across app, system and infrastructure healthgrow and scale your IT operations to meet future challenges. Perhaps Bring together any type of data and performance metrics intomost important, AIOps can solidify IT’s role as a strategic enabler ofone consumable placebusiness growth.13

Learn More.For more information on AIOps: Artificial Intelligence for IT Operations (AIOps) Market Guide for AIOps PlatformsSplunk, Splunk , Data-to-Everything, D2E and Turn Data IntoDoing are trademarks and registered trademarks of Splunk Inc.in the United States and other countries. All other brand names,product names or trademarks belong to their respectiveowners. 2020 Splunk Inc. All rights reserved.2020-Splunk-AIOps-Essential Guide to AIOps-117-EB

4. IT service management (ITSM): A general term for everything involved in designing, building, delivering, supporting and managing IT services within an organization. ITSM encompasses the policies, processes and procedures of delivering IT services to end users within an organization. AIOps provides benefits to ITSM by letting IT professionals