Predictive Models For Support Of Incident Management Process In It .

Transcription

Acta Electrotechnica et Informatica, Vol. 18, No. 1, 2018, 57–62, DOI: 10.15546/aeei-2018-000957PREDICTIVE MODELS FOR SUPPORT OF INCIDENT MANAGEMENT PROCESSIN IT SERVICE MANAGEMENTMartin SARNOVSKY, Juraj SURMADepartment of Cybernetics and Artificial Intelligence, Faculty of Electrical Engineering and Informatics,Technical University of Košice, Letná 9, 042 00 Košice, Slovak Republic,E-mail: martin.sarnovsky@tuke.sk, juraj.surma@student.tuke.skABSTRACTThe work presented in this paper is focused on creating of predictive models that help in the process of incident resolution andimplementation of IT infrastructure changes to increase the overall support of IT management. Our main objective was to build thepredictive models using machine learning algorithms and CRISP-DM methodology. We used the incident and related changesdatabase obtained from the IT environment of the Rabobank Group company, which contained information about the processing ofthe incidents during the incident management process. We decided to investigate the dependencies between the incident observationon particular infrastructure component and the actual source of the incident as well as the dependency between the incidents andrelated changes in the infrastructure. We used Random Forests and Gradient Boosting Machine classifiers in the process ofidentification of incident source as well as in the prediction of possible impact of the observed incident. Both types of models weretested on testing set and evaluated using defined metrics.Keywords: IT service management, incident management, classification, data analysis1. IT SERVICE MANAGEMENT AND INCIDENTMANAGEMENT PROCESSTo precisely define what a service management is, it isneeded to specify what a service is, or more concrete,what is the IT service. According to [1] a service is ameans of delivering value to customers by facilitatingoutcomes customers want to achieve without theownership of specific risks and costs. This definition ofservice is rather general. Speaking of IT services, we willconsider the services that in some way facilitate ICTtechnologies to its use. IT service can be then consideredas one or more IT systems and mechanisms, which enablebusiness processes of the organization. To ensure, that theIT services satisfy the customer’s needs and to usecorresponding ICT technologies effectively, these must beput into specialized management processes. Thisdiscipline is called IT Service Management (ITSM) [2],and it is defined as a set of specialized organizationalcapabilities for providing value to customers in the formof services. The main goal of ITSM is to ensure deliveryof quality IT Services that support the business objectivesof the organization by using the cost-effective resources.ITSM evolved during the time into the highlystandardized frameworks based on best practices. Bestpractices evolved into the industry standards formanagement of ICT (ISO/IEC20000) [3] and also into thepublic domain frameworks such as ITIL, or CoBiT [4].ITIL (IT Infrastructure Library) is nowadays a de-factostandard when implementing ITSM into businesses. Itprovides a comprehensive set of best practices for ITSM.It is based on the experiences and mistakes that weremade in the UK and Europe during the implementation ofthe IT projects and provided a collection of the bestpractices observed in the IT service industry. Thanks tothe ITIL including practices that really worked it started tobe adopted outside of the British government sector forwhich it was originally intended and around the turn of thecentury ITIL was considered as the internationallyISSN 1335-8243 (print) 2018 FEI TUKEaccepted standard for managing of information servicestechnology. Currently, ITIL consists of five parts, eachcorresponding to the particular phase in the IT service lifecycle. Service Strategy [5] provides a practical frameworkto design, develop and implement service management notonly from an organizational point of view but also from asource of strategic advantage. The strategy of the serviceprovider must be based on the fact, that the customer doesnot buy products, but tries to satisfy specific needs. Theprovider must understand the broader context of currentand potential markets where it operates or intends toprovide such services. Service Design [6] phase aims todesign the services to meet agreed outcomes. Service isdesigned including its components and complementedwith additional data like functional and operationalrequirements, acceptance criteria and plans for thedeployment of services in operation. Service Transition[7] describes the life-cycle phase of transition of theservice into the live environment. It combines proceduresincluding Release Management, Program ManagementRisk Management. In addition, the publication describesthe processes associated with change management.Equally important part in this phase of the introduction isthe concept of Configuration Management Database(CMDB), which is a database that documents theattributes of each component of IT infrastructure (knownas Configuration Item, CI) and provides a model of ncies. Service Operation [8] provides proceduresfor managing live and operating services in a productionenvironment, achieving efficiency and effectiveness inservice delivery and support them so that the producedvalue will benefit the customer as well as the serviceprovider. Processes, which are described in thepublication, serve for monitoring, maintenance, andservice improvement. This includes managing incidentsand service requests, problem management, andoperations management. Continuous Service Improvement[9] contains the means for creating and maintaining valueISSN 1338-3957 (online), www.aei.tuke.sk

58Predictive Models for Support of Incident Management Process in IT Service Managementadded customer service by increasing service quality andefficiency of their operations. It combines principles,practices, and methods of quality management, changemanagement and capacity improvement, while working toimprove each stage of the lifecycle, as well as the currentservices, processes-includes means for creating andmaintaining value services by increasing service qualityand operational efficiency.The work presented in this paper mostly deals with theService operation phase and handling of the incidents.Incident in this context can be described as an event thatleads to service interruption or is causing the servicequality level decrease. Incident management is a process,that specifies how to handle the incidents in a unified way.The main objective of the process is to restore the serviceas soon as possible. It specifies the steps needed toperform within the process, such as prioritization andcategorization, and specifies the recommendations how itis done. The process describes which information have tobe recorded to provide its accurate representation and alsothe necessary steps needed to be performed before actualsolution. Also, two different types of escalations areintroduced here. Functional and hierarchical escalationsare different in a way, how the escalation itself isperformed. Functional one escalates the incident to aspecialized group (designed to solve an incident of thistype) directly, while hierarchical escalation designates theincident to a higher level in hierarchical structure of the ITdepartment or organization. The process then specifies thesteps needed to close and review the incident.2. INCIDENT MANAGEMENT DATA ANALYSISOur main objective in this work was to perform thedata analysis on top of the ITSM incident managementdata. We were exploring two different tasks. The first onewas to explore the dependency between the CIs whichwere primarily assigned to the incident by a Service Deskand the CIs, which were actually responsible for theservice breakdown (and therefore were the primary sourceof the incident). Often, CIs reported with the incidents areCIs, where the incident is observed, but are not directlyresponsible for service breakdown, as the incident couldbe triggered elsewhere (on another CI). The second taskwas focused on the exploration of dependency betweenthe incident and change management. Often, incidents(after their investigation), can lead to changes (changes ininfrastructure; e.g. replacement of the CI for a newer one,etc.). For incident managers, the information if an incidentcan lead to the change could be interesting. Our goal inthis task is to build the model, which will be able topredict the need for change for a particular incident. Weused the CRISP-DM (Cross-industry Standard for DataMining) [10] methodology, which is nowadays a standardin solving of data analytical tasks. CRISP-DM Consists ofsix major phases. Business/problem understanding focuseson the understanding of the project objectives andrequirements and converting of the problem into the datamining problem definition. Data understanding covers thedata collection and getting familiar with the data, identifythe data problems and gaining first insights. Datapreparation phase covers the activities to obtain the finaldataset from the raw data. It usually includes multipleISSN 1335-8243 (print) 2018 FEI TUKEmethods of data transformation, attribute selection, andcleaning. Modeling phase involves the application of themodeling techniques and calibrates of their parameters tooptimal values. The evaluation examines the constructedmodels and matches the results to the objectives set duringthe initial phases. Deployment phase represents theimplementation of the models into the production.Following sub-sections represent the particular phases ofthe methodology applied to our problem.2.1. Problem understandingIncident management is a process which the mainobjective is to restore the operation of the IT serviceaffected by corrupted CI as fast as possible. That processis often implemented in non-ideal fashion, severalactivities performed by human operators can cause delays.Therefore, there is a need for tools assisting in theparticular process activities in order to establish morefluent and effective process execution, in certain situationsalso to enable the automation of the particular processsegments. The main idea is to leverage the existing dataabout the incidents, their processing, and related changesdata and to use the knowledge extracted from theserecords to build the predictive models designed to assistthe operators during the incident management process. Aswe mentioned above, we decided to focus on two selectedtasks. From data analysis perspective, we will buildpredictive models, which could be used during the processof Incident Management to assist the operators and peopleinvolved in the process with certain activities. The firstmodel will be used in prediction if the CI associated withthe incident is actually the one really responsible for theincident occurrence. We will use a proper classificationmodel, trained on the database of historical incidents, topredict if the reported CI triggered the incident. Thesecond model will investigate the dependency between theincidents and changes in the infrastructure triggered bythose incidents. Also, in this case, we will useclassification methods, trained on the historical data. Inthis case, target attribute will describe, if the incident willresult in change or not. Both models will be tested andevaluated using pre-defined criteria – we will focus on aselected set of metrics used to evaluate the models. Atfirst, we will measure the classifier precision and errorrate. More detailed investigation of model results will bedescribed using confidence matrix and ROC (ReceiverOperator Characteristic) and AOC (Area Under the Curve)[11] metrics. From other combined metrics, we also usedF1 metric, which combines both precision and recall.2.2. Data understanding and data preparationWe used the data provided by the ICT division of theRabobank Group (Dutch bank) [12]. The dataset consistedof several files containing specific records. Changerecords contained information extracted from the ServiceManagement tool from the process of ChangeManagement and implementation of the changes. Incidentrecords described the processing of the incidents.Interaction records contain also related records as well asresolution description with knowledge managementrelated fields. The last one was Incident activity recordsdataset which tracked specific activities related to theISSN 1338-3957 (online), www.aei.tuke.sk

Acta Electrotechnica et Informatica, Vol. 18, No. 1, 2018solution of the particular incident. For our purposes, weworked mostly with the Incident and Change recordsdataset. Both contained the detailed descriptions ofoccurred incidents and changes, associated ConfigurationItems, times of opening, closure, related incidents, relatedchanges, etc. Dataset was used in several studies, mostlyrelated to process mining [13][14] and prediction of theimpact of the changes [15]. In [16], authors usedpredictive models (based on trees, SVM and ensemblemodels) to predict the duration of the change and itsoverall impact. Overall goal was to predict the ServiceDesk workload based on interactions with affected CI.Statistical methods were used in [17] to analyse theincident ticket attributes to identify trends and unusualpatterns in operation. In general, research in this area oftenaims towards automation of certain activities within theService Operation processes to make Service Desk moreeffective [18]. In [19], an decision-making model isintroduced, which is able (using knowledge base) achievethe overall process automation and improve the efficiencyof provided incident responses. On the other hand, alsoincident relations can be investigated in order to find reoccurring or co-occurring incidents [20]. In some cases,certain predictive tools are integrated into the frequentlyused ITSM tools, e.g. SAP HANA supports real-timepredictions using SAP Predictive Analytics1 orServiceNow2 can be extended with Predict Incidentsmodule with such capabilities. Our task was similar toresearch performed in area of investigation of incidentsrelations. We focused on investigation, if the reported CIwas actually the CI that generated the incident and therelation between the incident and resulting changes.Following paragraph will introduce main attributes of theraw incident and change records data as present in thedataset.Attributes Description – Incident recordsCI name (Aff) – CI where a disruption of the service wasnoticed.CI Type (Aff) – type of the CICI Subtype (Aff) – sub-type of the CIService comp WBS (Aff) – every CI in CMDB areconnected to 1 Service Component to identify who isresponsible for the CIIncident ID – unique ID of the incidentStatus – status of the incidentImpact – impact of the service downtime to the customer.Urgency – how urgently the incident has to be solvedPriority – combines Impact and UrgencyCategory – used to categorize the incidents into the groupsaccording to their similarityKM number – Knowledge Document number – refers toKnowledge BaseOpen time – the time of the record opening in the Servicemanagement toolReopen time – if the incident was closed and re-openedResolved time – date and time when incident was resolvedClosed time – date and time when the record was closedHandle time – time needed to resolve the incidentClosure code – code that describes the type of cenow.com/ISSN 1335-8243 (print) 2018 FEI TUKE59Alert status – Alert status based on SLA (if it was or wasnot breached)#Ressignments – number of reassignments of the incidentduring the resolving#Related Interactions – number of related interactionsRelated Interactions – list of related interactions# Related Incidents - number of related incidents#Related Changes – number of related changesRelated Change – if a Change is related, it is recordedhere (multivalue field if more Changes are related)CI Name (CBy) – CI which caused the disruption of theserviceCI Type (CBy) – CI typeCI Subtype (CBy) – CI Sub-typeAttributes Description in Detail ChangeCI name (Aff) – CIs affected by the ChangeCI Type (Aff) – CI typeCI Subtype (Aff) – CI sub-typeChange ID – Change identifierChange Type – Change categoryRisk Assessment – specifies impact on businessEmergency Change – indicates if a Change is anEmergency oneCAB-approval needed – indicated if Change AdvisoryBoard approval is neededPlanned Start – date and time of Change implementationstartPlanned End – date and time of Change implementationfinishScheduled Downtime Start – date and time of scheduleddowntime during the Change implementationScheduled Downtime End – date and time of scheduledrestore after the Change implementationActual Start – actual date and time of ChangeimplementationActual End – actual date and time of service restoreRequested End Date – date and time of requested servicerestore after Change implementationChange record Open Time – date and time of Changerecord initiationChange record Close Time – date and time of Changerecord closureOriginated from – specifies the origin of Change request.#Related Interactions – number of interactions during theChange implementation#Related Incidents – number of incidents related to theChangeThe very first step of the data pre-processing was theidentification and removal of the missing values. Nine ofthe Incident dataset attributes contained missing values.After the data inspection, we removed several recordswith missing values and selected the missing valuesplaceholder, which specified the missing valueoccurrence. When applicable, we replaced the missingvalue with 0 (in case, that the missing value represented,that the event did not occur, for example in Reassignmentscase), in several numeric attributes (when it made a sense,e.g. number related interactions) we used the replacementusing mean value. Next step was to filter out the recordsin both datasets, as the incident dataset contained alsorecords representing the service requests, informativeISSN 1338-3957 (online), www.aei.tuke.sk

60Predictive Models for Support of Incident Management Process in IT Service Managementrecords etc. The main idea was to keep only the incidentrecords in incident dataset and changes in change dataset.Open.time attribute was transformed, and new attributeswere created. Those attributes specified the month, day ina week and hour of the incident opening. After the datacleaning and pre-processing, we integrated the data intocommon consistent table.Then we had to define and create the target attributesfor both models. Target variables, in this case, were notspecified in the dataset in explicit fashion but could betransformed from certain attributes in the tables. For thefirst predictive task, we created an attributeCI.Name.equality, which specifies, if observed andnoticed CI was really responsible for the incidentoccurrence or not. We compared the CI.Name.aff andCI.Name.CBy attribute values, on case those values wereequal, CI.Name.equality value was set to 1 and in theywere different, we set the newly created attribute to 0. Weused a similar approach to create the target attribute forthe second predictive task. In this case, we created anattribute Change.ID.equality. Its value was derived fromthe attribute values of Change.ID and Related.Change.We also explored the distribution of the target attributevalues in the dataset and decided to use the use one of thetechniques for the imbalanced class problem. Those willbe described in the modeling and evaluation sections.Then we could perform the descriptive characteristics ofthe dataset attributes, respective correlations and appliedfeature extraction methods. We decided to remove severalattributes that did not have a significant impact onclassification and obtained a final set of predictors (e.g.we used only Priority attribute and left the Impact andUrgency attributes, as the Priority value is directlycomputed from both of them). Among the most significantattributes in both tasks were CI Type, CI Subtype, Servicecomp as well as the attributes derived from Open.time.2.3. ModelingDuring this phase, we focused on predictive modelstraining. We used the R environment and as the machinelearning tool, we selected the H2o framework. H2o3 is anopen source software for data analysis and machinelearning. It provides an API for Java, Python and Rlanguage [21]. It also enables the developers to create H2ocluster on top of the big data analysis platforms andinfrastructures and to access the implemented distributedmachine learning models from R environment. H2opackage contains implementations of currently mostpopular machine learning algorithms, such as GeneralizedLinear Models (GLM), RandomForests, GradientBoosting (GBM), K-Means, Deep Learning and manyothers including utilities and tools for data access,preprocessing etc.For models training, we used the dataset split into thetraining, validation and testing sets in different sizes. Thetraining set was used to build the predictive models,validation set was used to optimize the model parametersand the completely independent testing set was used forevaluation purposes. We also did several experimentsusing the Cross-validation technique, to check if it bringsany benefit when used instead of dataset splitting. Then,3http://www.h2o.ai/ISSN 1335-8243 (print) 2018 FEI TUKEwe used several approaches to balance the distribution ofthe target attribute. We built predictive models based onRandom Forests and GBM algorithms in both tasks.Those models were selected after the preliminaryexperiments. Those proved, that the models were(precision and recall-wise) more suitable to handle thedata. Therefore we continued with training andoptimization of these models using the validation set. Inthe first task, we experimented with different parametersof the Random Forest and GBM models. The best resultson validation set were achieved when using those settingswith Random Forest model: ntrees 200, stopping rounds 3, score each iteration TRUEwhere ntrees parameter specifies the number of the treesbuilt within the forest, stopping round parameter, whichis not enabled by default, is used for early stopping toprevent the overfitting. The stopping metric was set toAUC and stopping tolerance parameter to 0.0005. Thestopping parameters specify, that the model learning willstop after there have been three scoring intervals, wherethe AUC has not increased more than 0.0005. We used thevalidation set, the stopping tolerance was computed onvalidation AUC, not on the training set itself. When usingGBM model, we used those parameter values: learn rate 0.3 stopping tolerance 0.01In this case, we used those extra parameters learningrate parameter was used to control the learning rate of themodel. Smaller values of the parameter causing the modelto learn more slowly, with more trees to reach the sameoverall error rate, but typically result in a better model,more general one, especially on the testing data.Therefore, we experimented on the validation set withmultiple learn rate values and obtained best results whenlowering the value of the learning rate to 0.001. Stoppingtolerance in this model was set to 0.001.For the second task, we used the same approach andselected the same parameter values for both models.2.4. Evaluation of the modelsThis section is dedicated to the model’s evaluation ofboth tasks. We used several approaches to measure themodel accuracy of the testing set. As the main metric, weReceiver Operator Characteristic Area Under the Curve(ROC AOC) which is commonly used to present resultsfor binary decision problems in machine learning. Table 1summarizes the results of the models with differentsampling methods used and train/test split sizes for thefirst task. The best model (Random Forest trained on split70/10/20) achieved best results. The average error ratewas 13,1%, split between both values of the predictedclass.The confusion matrix showing the classification intothe particular classes and classification errors is shown inTable 2. F1 metric (which combines the precision andrecall) of the model was 0.9247. The class 0, representingthat the incident was not caused by reported CI, was theclass with relatively high error rate. On the other side,ISSN 1338-3957 (online), www.aei.tuke.sk

Acta Electrotechnica et Informatica, Vol. 18, No. 1, 2018more important in this task is to confirm the fact, if theincident was caused by reported CI. Classification of thisclass was more precise and from the task perspective, theerror rate on this class can be more significant than to theother one. We focused mostly on prediction of class 1 sothe best models could have class 0 trained with relativelyhigher error-rate.Table 1 Results of the models in the first 085.065x Crossvalidation10x /1085.27612.5. DeploymentDeployment of the models into the productionenvironment represent the final stage of the CRISP-DMmethodology. In this case, we demonstrated the possibilityof the model deployment and integration by theimplementation of the web-user interface, which simulatesthe user interface of the service management tools, thatare usually used in businesses for ITSM purposes. Theapplication serves as a web-based interface to the data andmodels. It enables the model scoring functionality –recording of the incident data (data reported to the servicedesk, data recorded when an incident occurs) andperforming prediction (both models) on that data. Theoutput of the models may serve as a kind ofrecommendation for an operator working within theIncident Management process with such kind ofapplication. Other implemented functionalities includeseveral visualizations of the incident data. Suchvisualizations can provide the operator better insight intothe incident data and enable them to build a bettercomplete picture of the incidents and related changes. Fig.1 depicts the user interface of the implementedapplication. The application was implemented usingRshiny.Table 2 Confusion matrix for the best model in task 09868790.60980.03350.1308Table 3 summarizes the results of the models built forthe second task. Similar to the first experiment, RandomForest model achieved best results. In case of the bestmodel, average error of the model was 6,8%, when 0 classwas classified incorrectly more often (10,5 % error rate),while class 1 achieved error rate 3,2%.Table 3 Results of the models in the second gNoneOversamplingOversamplingOversamplingAUC3x Crossvalidation5x 0.625x Crossvalidation91.64ISSN 1335-8243 (print) 2018 FEI TUKE90.9391.80Fig. 1 Web-based application for the deployment demonstrationpurposes3. CONCLUSIONSThe main objective of the work presented in this paperwas to design and develop the prediction models used inthe Incident Management process. We used the dataset ofincident records and related changes and specified twomain areas, we tried to explore. The first one was therelationship between the reported component of theinfrastructure and affected one. The second tried to coverthe relationship between the incident and related changes.We built predictive models to solve both of presentedproblems, used Random Forest and GBM models in bothcases. Our main objective was to find the best modelspossible, all models were evaluated on the testing setusing ROC curve. All pre-processing steps and all modelswere implemented in R environment and we used the H2oas the machine learning library. As a possible deploymentscenario, we implemented the web-based user interface inRshiny. Such application demonstrates how the modelscould be used if integrated into the real productionenvironment. The entire process was guided using theCRISP-DM methodology.ISSN 1338-3957 (online), www.aei.tuke.sk

62Predictive Models for Support of Incident Management Process in IT Service ManagementACKNOWLEDGMENTThis work was supported by Slovak Research andDevelopment Agency under the contract No. APVV-160213 and by the VEGA project under grant No.1/0493/16.REFERENCES[1] YOUNG, C. M.: ITSM Fundamentals: How toCreate an IT Service Portfolio, Gartner research note1–6, 2011.[2] SARNOVSKY, M. – FURDIK, K.: IT servicemanagement supported by semantic technologies, In:SACI 2011 - 6th IEEE International Symposium onApplied Computational Intelligence and Informatics,Proceedings, 2011.[3] DISTERER, G.: ISO 20000 for IT, Business &Information Systems Engineering, 1, pp. 463–467,2009.[4] ISACA: COBIT 5 Framework, 2012.[5] CANNON, D.: ITIL Service Strategy 2011 edition,(2011).[6]HUNNEBECK, L.: ITIL Service Design, 2011.[7] CANNON, D.: ITIL Service Transition, 2011.Netherlands, 2014.[15] BUFFETT, S. – EMOND, B. – GOUTTE, C.: UsingSequence Classification to Label Behavior fromSequential Event Logs, In: 2014 Business ProcessIntelligence (BPI) Challenge, p. 27, 2014.[16] DEES, M. – Van Den END, F.: A Predictive Modelfor the Impact of Changes on the Workload ofRabobank Group ICT’s Service Desk and ITOperations BPI Challenge 2014.[17] LI, T. H. – LIU, R. – SUKAVIRIYA, N. – LI, Y. –YANG, J. – SANDIN, M. – LEE, J.: Incident TicketAnalytics for IT Application Management Services,In: 2014 IEEE International Conference on ServicesComputing, pp. 568–574, IEEE, 2014.[18] ANDREWS, A. A. – BEAVER, P. – LUCENTE, J.:Towards better help desk planning: Predictingincidents and required effort, Journal of Systems andSoftware, Vol. 117, pp. 426–449, 2016

incident to a higher level in hierarchical structure of the IT department or organization. The process then specifies the steps needed to close and review the incident. 2. INCIDENT MANAGEMENT DATA ANALYSIS Our main objective in this work was to perform the data analysis on top of the ITSM incident management data. We were exploring two .