Major Incident Handbook For Services - Harvard University

Transcription

Major Incident Handbookfor ServicesHotline617-496-2831July 2015Questions? Contact IT Service Management at itsm@harvard.edu.

Table of ContentsOverviewIncident Priority LevelsReport Major IncidentsGoals of Major Incident ProcessHigh-Level Process and StepsMajor Incident CategoriesKey RolesCommunicationsProcess and ProceduresAssessContainResolveRoles & ResponsibilitiesCode of ConductIncident ManagerTechnical TeamsService Desk and SOC OpsIncident CoordinatorIncident LeaderRACI ChartAppendix345678910111218212324262829303132332

Major Incidents: Overview3

Incident Priority LevelsUniversity digitalemergencyCrisisMajor 0.1%CriticalUnplanned serviceinterruption or degradationthat disrupts teaching,learning, research, and/oradministration. 1%OperationalHigh 5%Normal 95%Category1 Initial assessment orlimited impact2 Known solution orworking toward a solution3 Significant impact with noknown solution

Report Major IncidentsSee something, say something If you think there may be a major incident, call the hotline.If in doubt, err on the side of calling; don’t wait!Call the Hotline Anytime!617-496-2831Provide as much information as possible to facilitate the process: Incident start time Services or applications impacted Impact to users or University functions Teams need for troubleshooting Initial diagnosis or current actions, if any5

Goals of Major Incident Process Minimize negative impact to the institution and itsmission. Restore normal service as quickly as possible.– Implement workaround, if it enables a faster resolution.– Balance service restoration (incident management) withgathering root cause information (problem management). Marshal necessary staff and resources for resolution. Communicate appropriately and promptly.– Internal: HUIT, including senior leadership as required– External: Users and stakeholders– For security incidents, different protocols may be necessary. When possible, preserve forensics for further analysis.6

Major Incident: High-Level Process and StepsDeclaration of MI Triggered by Service Desk reporting, event monitoring, and/or individual Preliminary information gathering regarding incident (Incident Coordinator)ASSESSInitial Coordination Initial assessment and categorization of MI (Service/Offering Owner) Convene service/offering owner and technical team on conference bridge (Incident Coordinator) Notify HUIT staff (Incident Coordinator) and users (Service/Offering Owner)CONTAINRESOLVEInvestigation and Diagnosis Marshaling of appropriate troubleshooting resources (Service/Offering Owner) Ongoing communication (Service/Offering Owner) Escalation as needed (Service/Offering Owner or Incident Coordinator) Vendor management as needed (Service/Offering Owner) Implement temporary workaround or permanent solution (Service/Offering Owner)Resolution and Closure Validation that all services are operational, including downstream (Service/Offering Owner) Gathering of all key information, e.g., end time, actions, root cause when known (Service/OfferingOwner and Incident Coordinator) Final communication, including HUIT Alert notification of resolution (Incident Coordinator) Trigger problem review and After Action Review7

Major Incident Categories and FlowCategory DescriptionIncident Leader Role1Initial assessment or limitedimpact with risk of escalationInformed2Known solution or working towarda solutionInvolved,if extended duration3Significant impact with no knownsolutionLeadingTime-boxed:Assess category statusevery 30 minutes8

Key Roles in Major Incident ManagementArea ofLeadershipRoleLeadershipIncident Leader (Member of SLT; typically MD of affected service) Owns progress during category 2 or 3 Major IncidentCommunicates with CIO/DCIO, other senior HUIT staff, and key stakeholders(e.g., deans)Service andTechnicalIncident Manager (Service/Offering Owner)ProcessIncident Coordinator (ITSM) Qualifies incidentLeads troubleshootingMarshalls resources for troubleshooting and resolutionAccountable for communication with users and IT stakeholdersDetermines when incident has been resolvedConducts After Action Review and problem reviewLead process facilitatorInitiates contact with service/offering ownerEstablishes conference bridgeProvides initial communication (e.g., HUIT alerts & website, end users ifpredefined for affected service)Documents material technical, communications, and related information in ticket9

Major Incident CommunicationsHUITCommunityDeclaration Ticket created (Incident Coordinator) HUIT alert - New (Incident Coordinator) Updates to UCIO/DCIO, if category 2 or 3 (IncidentLeader) HUIT website and Twitter(Incident Coordinator) Service Desk phone message End users and stakeholders(Service/Offering Owner)Assess andContain Updates every 30 minutes to Incident Coordinator and, ifapplicable, Incident Leader (Service/Offering Owner) Updates to UCIO/DCIO, if category 2 or 3 (IncidentLeader) HUIT alert – Category Change (Incident Coordinator) As necessary, update endusers and stakeholders(Service/Offering Owner)Resolve Declaration of resolution to Incident Coordinator and, ifapplicable, Incident Leader (Service/Offering Owner) HUIT alert - Resolved (Incident Coordinator) Updates to UCIO/DCIO, if category 2 or 3 (IncidentLeader) HUIT website and Twitter(Incident Coordinator) End users and stakeholders(Service/Offering Owner)10

Major Incidents:Process and Procedures11

Declaration of Major IncidentTriggersASSESSEnd UsersSudden flood of ticketsStaffPreventative measure, alerts, vendor notificationsCriteriaUrgencyLMHIntolerance for delayCONTAINRisk of EscalationLMHLMHinto more widespread issueSize of PopulationRESOLVE}Declare MajorIncident, if twocriteria are M / H.Responsible for declaration: Incident Manager If above unavailable, Incident Coordinator12

Initial Coordination Incident Coordinator convenes service/offering owner andtechnical team on conference bridge.ASSESSThe Conference Bridge is critical forefficient troubleshooting and centralized communications. All required resource should join within 15 minutes. CONTAINIf no response, Incident Coordinator escalates to Service Owner and/orDirector/MD. Once service/offering owner (or proxy) joins, s/he leads call. Initial discussion on bridge:1. Articulate issue.2. Assess business impact.3. Review recent changes. Open separate, technical bridge, if needed.RESOLVENB: Activity on bridge should focus primarily on service restoration. Root cause isimportant, but should generally remain secondary. Whenever possible, preserve forensicinformation in support of further analysis and After Action Review.13

Conference Bridge Phone Numbers and AccessJoin the Bridge:866-890-3820 or 334-323-7229ASSESSCONTAINBridgeLeaderParticipantMI Bridge71284835#38793366#Secondary Bridge58775889#86065154#Tertiary Bridge52545874#42832588#Conference Bridge FeaturesRESOLVELEADER*2Begin / end recording*7Lock / unlock conference72# Roll call81# Mute all lines80# Unmute all linesANYONE*0Operator*6Mute / unmute your line14

Major Incident Classification Incident Manager (or proxy) provides initial classification.– Based on reported and actual user impact, event monitoring, availability ofknown solutions, and potential to become a crisis.ASSESS– If Incident Manager unreachable, this assessment defaults to the IncidentCoordinator. Preliminary assessment should be made and then updated every30 minutes (may be longer, depending on investigation needed).– It is better to err on the side of caution and, if appropriate, downgrade to Criticalduring or after the incident.CONTAIN If no possible solution is identified within 2 hours, MI categoryshould be upgraded. Incident Manager may adjust classification, as additionalinformation is gathered, based on time of year and/or businessneeds, etc.CategoryRESOLVEDescriptionIncident Leader Role1Initial assessment or limited impact with risk ofescalationInformed2Known solution or working toward a solutionInvolved, if extended3Significant impact with no known solutionLeading15

Information Security IncidentsASSESS Notify Information Security Operations of any incidentsthat may pertain to information security (e.g., systemcompromise, compromise of administrative credentials). Incidents with a significant information securitycomponent should be handled differently from thestandard Major Incident protocol.CONTAINRESOLVE– In the case of system and/or account compromise,sufficient time must be allotted for accurate and detailedassessment of the scope of the incident.– Communications outside of – or even internal to – HUIT areoften limited (e.g., no ticket in ServiceNow, no posting onHUIT status page) to avoid “tipping off” the perpetrator(either through our own or public channels) and minimizeanxiety. For these incidents, the CISO (or designate) acts asIncident Leader.16

Initial Communication of a Major Incident Incident Coordinator creates ticket in ServiceNowASSESS– Primary channel for internal updates on progress Communications chain initiated for category 2(extended) or category 3:Incident Manager à Incident Leader à DCIO/CIO Incident Coordinator notifies HUIT staffCONTAINRESOLVE– HUIT alerts– HUIT website (status.huit.harvard.edu)– Twitter (@HUITAlerts) Update of Service Desk phone message with anyspecific steps/information Incident Manager notifies users and stakeholders17

Investigation and DiagnosisASSESSGuiding Principles & Procedures Pursue multiple leads and parallel work streams asappropriate. Avoid combining seemingly-related incidents.– Continue to troubleshoot as separate incidents until it’sconfirmed they are related.CONTAIN Review change calendar to identify any potential causes orimpact. Follow troubleshooting checklists; leverage workflows orservice maps (if available).RESOLVE18

Investigation and DiagnosisIncident Manager (aka Service/Offering Owner) Technical investigation and diagnosisASSESS Marshaling of appropriate troubleshooting resources– Launch 2nd conference bridge for technical discussions, ifneeded; coordinate between two conferences to ensure timelyupdates Vendor management as neededCONTAIN Ongoing status updates for communication needsIncident Coordinator Documentation of all material technical, process,communications, and related information in ticket Additional HUIT Alerts, if MI category changesRESOLVEIncident Leader Hourly updates to CIO/DCIO and key senior-levelstakeholders19

Technical Escalation Escalations raise involvement and awareness of incident tomore advanced skill sets or senior decision-making levels.ASSESS Incident Manager is accountable for the overall escalationprocess. Current level notifies the next level no later than the hourindicated below.CONTAINNLT* HourTechnicalTroubleshooting Admins, Engineers, Developers, etc.Begins2Sr. Tech. Engineers and Architects4Vendor (if applicable)RESOLVE*No Later Than20

Resolution and Closure ResolutionASSESS– Incident Manager validates all services are operational,including those downstream– Incident Coordinator gathers all key information (e.g.,end time, actions, root cause when known) andincludes in ticket Incident ClosureCONTAIN– Final communication, including HUIT Alert notificationof resolution– Problem review and After Action Review for all involved Occurs following Tuesday during Problem Management meetings Confirm/update root causeRESOLVE Address any open issues or concerns Identify necessary mitigation and next steps, including ownersand timelines21

After Action Review Template Incident #ASSESS Duration– Start date– End date– Total time Affected SystemsCONTAIN Symptoms Chronology andSummary of Events Workaround/Solution Root Cause Analysis– Technical Process &Communication Review– What went well?– What can be improved? Next Steps– Preventative measuresto prevent recurrence– RecommendationsRESOLVE22

Major Incidents:Roles & Responsibilities23

Code of Conduct Remember to Act according to the HUIT Values (i.e. user-focused,collaborative, innovative, and open), especially during these criticalperiods. Report Issues: Call the hotline at 617-496-2831!– “See something, say something.”– Avoid calls to the Service Desk or individual staff.– For crisis, calls may be made directly to the ESF leader. Respond Quickly when contacted by the Incident Coordinator.– Call into the bridge as soon as possible, but no later than 15 minutesafter notification. Respect the Process to ensure efficient resolution and maximumcollaboration.– Limit conference bridge to essential staff: Avoid unnecessaryparticipants, which may impede progress and clear communication.24

Code of Conduct Ensure Timely Access to Appropriate Staff (Yourself or Others).– Maintain and share up-to-date lists of on call information, staff (includingvacation and back-up coverage), and phone numbers.– Provide readily available access to needed technical expertise(especially during times of escalation or transition).– Dedicate your own or your team’s time as warranted and attention toensure speedy service restoration. Coordinate Internal Communications: Material troubleshooting,communications, and decisions should be communicated/validatedon conference bridge and in the ticket.25

Incident Manager ResponsibilitiesService/Offering Owner or proxy is accountable for the restoration of aninterrupted or degraded service. Confirm and classify Major Incident, based on organizational impact. Lead troubleshooting effort:– Identify, marshal, and deploy technical resources.– Lead discussion on primary conference bridge (and coordinate withtechnical bridge, if applicable). Approve proposed fix or workaround. Confirm resolution of Major Incident. Accountable for communications with stakeholders and end-users. Responsible for After Action Review:– Review and validate record of events.– Determine and implement preventative measures and n

RACI Chart 32 Appendix 33 2 . Major Incidents: Overview 3 . Incident Priority Levels Crisis Major 0.1% Critical 1% High 5% Normal 95% Unplanned service interruption or degradation that disrupts teaching, learning, research, and/or administration. University digital emergency Operational Category 1 Initial assessment or limited impact 2 Known solution or working toward a solution 3 .