Incident Management Procedures - Northwestern University

Transcription

Incident Management ProceduresNovember 22, 2013Version 1.4

Table of ContentsDocument Control . 3Summary of Changes . 3Document Change-Approver . 3Document Approvals . 4Document Review Plans . 4How to Find the Latest Version of this Document . 4Overview . 4Description and Scope . 4Objectives and Performance Metrics . 4Incident Management Process Flows. 6Work Instructions . 14Roles and Responsibilities . 22Priority Classification . 24Overview . 24End User Knowledge of Priority . 25Assessment Process. 25End User Escalation Processes . 25Changing Priority (Impact and Urgency) . 25Urgency . 25Impact . 26Priority . 27Communication Timelines . 28Key Terms and Definitions . 29Last Revised: 01/13/14Page 2 of 31

Document ControlSummary of ChangesVersionVersion DateNature of ChangeEdited By1.02009‐October‐15Initial Document1.0.12009‐October‐20Edited for FormattingMatt Gruhn1.0.22009‐October‐21Edited ARCI DefinitionsAaron Mansfield1.1Don Strickland1.1.12009‐December‐08Michael Satut1.22009‐December‐10Edited Don’s suggestions; edited Lynne Jeffersfor formatting1.2.12009‐December‐17Edited for formattingMatt Gruhn1.2.22009‐December‐17Edited definitionsAaron Mansfield1.2.32009‐December‐22Edited for formattingLynne Jeffers1.32009‐December‐23Updated process flows and work Matt Gruhninstructions1.3.12009‐December‐30Updated and formattedcommunication timelinesMatt Gruhn1.3.22010‐November‐09Updated formattingAaron Mansfield1.42011‐March‐28Edited priority matrixAaron Mansfield1.4.12011‐May‐06Updated priority 1 and 2resolution goals and associatedcontentAaron Mansfield1.4.22011‐June‐24Updated timing related to Major Aaron MansfieldIncident Management1.4.32011‐August‐04Updated P2 CommunicationGuidelinesAaron Mansfield1.4.42013‐November‐22Updated priority 1 to includeproblem ticket informationAaron MansfieldDocument Change-ApproverTitleNameE‐mailTSS Associate Director (Document Aaron MansfieldOwner)aaron@northwestern.eduSenior Technical Services Specialist Michael Jonesmichael@northwestern.eduLast Revised: 01/13/14Page 3 of 31

Document ApprovalsThe document owner is responsible for the accuracy and integrity of this document.Document changes are made through the change management process. To initiate a change to this document, e‐mail the document owner.Proposed changes will be reviewed by the document change‐approvers listed above.After approval from those listed above, the updated document will be presented to the Change Advisory Board(CAB) for final approval.Document Review PlansThis document is reviewed and updated as defined below: As required to correct or enhance information content Following an annual reviewHow to Find the Latest Version of this DocumentThe latest and official version of this document may be obtained on the process documentation page of theNUIT wiki rocess DocumentationPrinted copies are for reference only and are not controlled. It is the responsibility of users of this documentto ensure that they are using the most recent version.OverviewThe incident management process includes the coordination of service recovery, notification, escalation, and eventreview for all services as defined in the Northwestern University Information Technology (NUIT) Service Catalog.This document is intended to provide high‐level overview of the incident management workflow.This document is to be used as reference for all NUIT staff to clearly understand the standards and procedures putin place to manage an incident through service restoration and incident review.Description and ScopeThis document describes the process to be followed for assessing an incident and determining the level of prioritybased on definitions of impact and urgency. Once priority is determined, the appropriate route for managing theincident resolution process is followed.Objectives and Performance MetricsAll processes must be measured to ensure compliance, effectiveness, and efficiency and to serve as a baseline forimprovement.Last Revised: 01/13/14Page 4 of 31

The objectives and associated metrics of the incident management process are as follows: Ensure timely incident resolution Measured by mean time to repair (MTTR) statistics, including performance againstassociated targets Maximize service availability Measured by incident handle time, broken down by support tier Number of major incidents Effectively manage customer communications and notification Measured by the number of updates and customer communications distributed via thefollowing channels: ACD emergency messages Emergency bulk‐mail messages End user feedback Service status web pages Improve communication between groups Measured by status updates in the service manager tool, including performance against SLAor OLA requirements Accurately assign incidents Measured by the percent of reassignments by the incident controllerLast Revised: 01/13/14Page 5 of 31

Incident Management Process FlowsEnd UserTimingInputsIncident Management: Interaction ManagementInitiate interactionContact NUITSupport(1)Tier 1 AnalystScreen t? (4)NoRequest forChange? oNoTier 1 IncidentManagementTerminateInteractionEnd user information andinteraction descriptiongatheredLast Revised: 01/13/14Page 6 of 31

Incident received frominteraction managementor event managementKnow error databasereviewedPreviously documentedworkarounds reviewedTier 1 AnalystTimingInputsIncident Management: Tier 1 Incident ss Urgency andImpact(6, 7)KnownError(s)? (8)WorkaroundAvailable? (8)YesApply Workaround(9)NoOutputsWorkaroundSuccessful? (9)YesUpdate Incident andClose ncy and impactdetermined; prioritycalculatedAttempted workarounddocumentedTicket closed and enduser updated if applicableLast Revised: 01/13/14Page 7 of 31

Last Revised: 01/13/14Page 8 of 31

Last Revised: 01/13/14Page 9 of 31

Last Revised: 01/13/14Page 10 of 31

Possible Priority 1identifiedTimingInputsIncident Management: Major Incident Management 1Within 15 minutes ofWithin 5 minutes of majormajor incident manager’sincident manager’sengagementengagementWhitin 15 minutes ofinitial contactEnd UserWithin 5 minutes of initialcontactTier 2 or 3 AnalystProvide Information to Support Analysts, as NecessaryTroubleshoot and Update the Incident Ticket (26)Tier 2 or 3Diagnosis andResolutionEstablish Status Call(25)Coordinate MajorIncident ManagementCommunication (23,24)IncidentControlOpen StakeholderBridge (20)OutputsNoCoordinateDowngradeNotification (22)Update the ACD Message, as Needed (27)2Priority 1Declared? (21)Subject MatterExpertIncidentControllerMajor Incident MgrAcknowledge Receiptof the Incident (19)YesStakeholder notificationcoordinatedIncident ticketdowngraded; downgradenotification coordinatedIncident and problemticket created.Status call informationdistributed to participantsLast Revised: 01/13/14Page 11 of 31

InputsIncident Management: Major Incident Management 2IncidentControllerEnd UserTimingEnd user verificationprovidedVerify ResolutionWith End User(33)Update ProblemTicket(34a)3Tier 2 or 3 AnalystParticipate in Status Call (29c)NoIdentify and Test aResolution (29)ImplementResolution inProduction (30)DetermineVerification Plan(32)ResolutionSuccessful?(31)Update IncidentTicket (34)2Manage the Status Call (28a)OutputsSubjectMatterExpertMajor IncidentManagerYesManage Incident Resolution (28b)Resolutiondocumentation composedVerification plandevelopedTickets updatedLast Revised: 01/13/14Page 12 of 31

Last Revised: 01/13/14Page 13 of 31

Work InstructionsStep Description1End user calls, e‐mails, chats or self‐service reports anincident to the service desk.OwnerEnd User2Gather end user information and interaction description.Verify that end user and service recipient information isavailable. If not, add relevant content.Tier 1 Analyst3Determine whether the interaction is valid. If it is not,terminate the interaction.Tier 1 Analyst4Determine whether the interaction is a service request. If it Tier 1 Analystis, initiate request fulfillment.5Determine whether the interaction is a request for change. Tier 1 AnalystIf it is, initiate change management.6Assess the impact of the incident using the NUIT policy as Tier 1 Analystdescribed in the definitions section at the end of thisdocument: Campus Wide Impact Departmental Impact Office Impact Single User Impact78Note: Automated tickets opened by monitoring toolsdefault to priority 3. The tier 1 analyst must evaluate thetrue impact and update the initial priority as needed.Assess the urgency of the incident using the NUIT policy as Tier 1 Analystdescribed in the definitions section of this document:1. Immediate2. Critical3. Elevated4. RoutineNote: Automated tickets opened by monitoring toolsdefault to priority 3. The tier 1 analyst must evaluate thetrue urgency and update the initial priority as needed.Determine whether the incident is a known error andTier 1 Analystwhether a workaround is available. If the incident is not aknown error or there is no known workaround, refer theincident to incident control.ARCIA End UserR End UserC Tier 1 AnalystIA Tier 1 AnalystR Tier 1 AnalystC End UserIA Tier 1 AnalystR Tier 1 AnalystC End UserIA Tier 1 AnalystR Tier 1 AnalystC End UserIA Tier 1 AnalystR Tier 1 AnalystC End UserIA Tier 1 AnalystR Tier 1 AnalystCIA Tier 1 AnalystR Tier 1 AnalystCIA Tier 1 AnalystR Tier 1 AnalystCI Incident ControllerLast Revised: 01/13/14Page 14 of 31

Step Description9Apply workaround if available. If the workaround fails,refer the incident to incident control. If the workaroundsucceeds, update and close the ticket.101112131415OwnerTier 1 AnalystARCIA Tier 1 AnalystR Tier 1 AnalystC End UserI Incident ControllerEvaluate the incident by validating the assessment. If the Incident Controller A Incident Controllerincident should be abandoned, update and close the ticket.R Incident ControllerCIIf incident is a possible Priority 1, coordinate assessment Incident Controller A Incident Controlleractivities. Go to step 19.R Incident ControllerC Tier 2 Analyst/For standard incidents, refer the ticket to an appropriateTier 3 Analysttier 2 or 3 team.I Major IncidentManagerIsolate and diagnose the incident. If diagnosis fails, or if no Tier 2 AnalystA Tier 2 Analyst/resolution is available for the diagnosed cause, determineTier 3 Analystwhether the incident should be reassigned. If not, continueR Tier 2 Analyst/to diagnose the incident.Tier 3 AnalystCIf the incident should be reassigned, refer the ticket back toI Incident Controllerthe incident controller for reevaluation.Determine whether a change is required to resolve theTier 2 AnalystA Tier 2 Analyst/incident. If so, initiate change management.Tier 3 AnalystR Tier 2 Analyst/Tier 3 AnalystCIApply and verify a resolution. If the resolution was notTier 2 AnalystA Tier 2 Analyst/successful, determine whether the incident should beTier 3 Analystreassigned. If not, continue to diagnose the incident.R Tier 2 Analyst/Tier 3 AnalystIf the incident should be reassigned, refer the ticket back toCthe incident controller for reevaluation.I Incident ControllerIf the initial tier 2 or 3 analyst could not diagnose or resolve Incident Controller A Incident Controllerthe incident, reevaluate the ticket and reassign to anotherR Incident Controllertier 2 or 3 group.CI Tier 2 Analyst16Verify the resolution with the end user.Incident Controller17Determine whether a problem has been generated by the Incident Controllerincident. If so, initiate problem management.ARCIARCIIncident ControllerIncident ControllerEnd UserTier 2 AnalystIncident ControllerIncident ControllerTier 2 AnalystLast Revised: 01/13/14Page 15 of 31

Step Description18Update the incident and close the ticket. Incidentmanagement is complete.192021222324OwnerTier 2 AnalystARCIA Tier 2 AnalystR Tier 2 AnalystCI End UserAcknowledge receipt of the incident by assigning theIncident Controller A Incident Controllerincident ticket to the appropriate tier 2 or 3 analyst.R Tier 2 Analyst/Tier 3 AnalystNote: If the assignment group recognizes the cause to be aC Major Incidentknown error with a standard resolution that can be quicklyManagerimplemented, they will notify the incident controller andI Incident Controllerno status call will be scheduled. In this situation, the tier 2or 3 analyst assumes the responsibility of coordinatingupdate and resolution notifications with the incidentcontroller as needed, including assisting in documentingthe impact.Simultaneously to the tier 2 or 3 analyst receiving andIncident Controller A Incident Controllerworking the incident, incident controller kicks off theR Incident Controllerpriority 1 assessment process by coordinating theC Tier 2 Analyst/stakeholder bridge for incident assessment. To avoidTier 3 Analystconfusion, the initial notification will contain minimalI Stakeholdersdetail.The Incident Controller is to work with the SME to establish Incident Controller A SMEconsensus that the priority has been assessed correctly.R Incident ControllerC StakeholdersThe Incident Controller is required to follow the standardIdefinitions of urgency and impact in determining thepriority.If the Incident Controller determines that the incident isIncident Controllernot priority 1, the incident controller will work to organizea downgrade notification, and the incident will be workedvia the standard incident process.If the incident is downgraded to a priority 2 or less, theincident controller will close the stakeholder bridge andcoordinate community notification.Engage the major incident manager, providing the incident Incident Controllerdetails (including IM ticket number) and if needed assist indeveloping the initial notification.Coordinate communication and community notificationactivities.Major IncidentManagerARCIIncident ControllerIncident ControllerSMEManagementStakeholdersA Incident ControllerR Incident ControllerMajor IncidentManagerCI ManagementStakeholdersA Major IncidentManagerR Major IncidentManagerCI ManagementStakeholdersLast Revised: 01/13/14Page 16 of 31

Step Description25Establish a technical bridge to permit support teams towork in parallel on incident resolution.262728aOwnerMajor IncidentManagerARCIA Major IncidentManagerR Major IncidentManagerCI SMEIncident ControllerTier 2 AnalystTier 3 AnalystTroubleshoot the incident, attempting to determine itsTier 2 Analyst / Tier A Tier 2 Analyst/root cause.3 AnalystTier 3 AnalystR Tier 2 Analyst/Tier 3 AnalystCIUpdate ACD (Automated Call Distribution) message asIncident Controller A Incident Controllerneeded.R Incident ControllerCIf an incident can be expected to produce a high volume ofI End Usercalls to the service desk, the incident controller can workwith the service desk to record a message to be played atthe beginning of the ACD menu before the callers hearsany menu options.Manage status call process:Major IncidentA Major Incident Request additional resources to join call as identified ManagerManagerby SMER Major Incident Verify the tier 2 or tier 3 team updates incidentManager Escalate within Northwestern University or serviceCpartner organizations, when required, to gainIadditional focus and resources to permit timelyresolution within service levels Monitor who is on the call Coordinate distribution of update notificationsregularly to communicate incident status.If no status call is scheduled, monitor ticket fortroubleshooting actions/progress and contact SME or tier 2or 3 analyst for status as needed.Last Revised: 01/13/14Page 17 of 31

Step DescriptionOwner28b Manage incident resolution:SME Identify appropriate personnel to participate in statuscall and notify major incident manager Coordinate update/resolution notifications with themajor incident manager as needed As appropriate add groups (or individuals) orrelease groups (or individuals) to/from the call. As needed, communicate initial troubleshootingsteps to additional resources as they join thebridge28cNote: Transferring ownership to another SME: If anincident is initially reported as impacting one service, butduring troubleshooting it is determined that a differentservice is impacted, this may call for the original SME totransfer ownership to a different SME. Similarly, if anincident is determined to be caused by an infrastructurecomponent and that component error impacts multiplecritical services, the SME role may be transferred to theSME for that infrastructure component. To transferownership:1. The original SME should request that the SME for theother service or infrastructure component join thestatus call.2. The original SME should brief the other SME,explaining how they reached the conclusion that atransfer of ownership would be appropriate.3. The other SME must then agree that transfer isappropriate and will then take ownership of theincident.Participate in status call and update ticket withTier 2 Analyst / Tiertroubleshooting actions and progress:3 Analyst Every 60 minutes for priority 1Note: If another team will implement the resolution, theincident ticket should be reassigned to that assignmentgroup.29Identify a resolution, document the resolution, and test the Tier 2 Analyst / Tierresolution on a non‐production environment if it makes3 Analystsense, time permits, and a duplicate environment exists.ARCIA SMER SMECI Major IncidentManagerA Tier 2 Analyst/Tier 3 AnalystR Tier 2 Analyst/Tier 3 AnalystCI Major IncidentManagerA Tier 2 Analyst/Tier 3 AnalystR Tier 2 Analyst/Tier 3 AnalystC SMEMajor IncidentManagerI

Jan 13, 2014 · 1 End user calls, e‐mails, chats or self‐service reports an incident to the service desk. End User A End User R End User C Tier 1 Analyst I 2 Gather end user information and interaction description. Verify that end user and service recipien