ESnet IT Service Management Update - Internet2

Transcription

ESnetIT Service Management UpdateApril 26, 2017Presented at Internet2 GlobalSummitPatty GiuntoliArea Lead Networking and SystemsESnet

ESnet: An international mission networking facilityoptimized for data-intensive science Connecting 50 labs, plants and facilities with 150 networks,universities, research partners globally 1.3 Tbps of external connectivity, including high speed access tocommercial partners such as Amazon 2Older than commercial Internet, growing twice as fast

80% of ESnet traffic originates or terminatesoutside of the DOE.2/10/2017

ITSM Progress over the last 18 monthsITSM AreaIncidentAssetChange45/11/17Outcomes1. Simplified Incident input and ticket format2. Introduced auto-assignment for specific incident types3. Goal-more accurate insight into operations, and simplerreporting1. Completed standardization of asset record fields2. Asset Management process defined, and staff trained3. Developing asset fit within ESnet Info Architecture1. Completed current process information gathering acrossESnet2. Developing high level strategy3. Implemented standard maintenance window

Even more progress ITSM AreaOutcomesServiceCatalog1.2.3.4.Launched service catalogDrives visibility and improved cross-group interactionsLess dependent on email as workflow engine.Metrics and feedback Identify popular requests ascandidates for workflows5. Generic “Request” reduces uses of IncidentConfiguration 1. Strategy development underwayIntegration55/11/171. ESDB and Lab Asset system and ServiceNow Integration(regular CSV imports)2. Future integrations, python and CLI integration based onREST via OAuth—more secure timely and easier to develop3. Auto-ticketing from Spectrum alerts4. Demonstrated auto ticketing from nagios5. Investigating ticketing from Splunk6. Successful upgrade from Fuji direct to Istanbul, skippingHelsinki

How we use metrics . 65% of all tickets are closedwithin 5 days 19% of tickets are closed thesame day as opened This is a LOT of incidenttickets Are they all really incidents? Let’s see 6

Deep Dive into Incidents reveals 123Total50Week446Total TicketsOpened (byOperator)PlannedMaintenanceCkt/BGP Flap574722Incident orServiceOutageOther,RequestPercentof Total91444924.522272833110551210883819110131.5200 BGP Flap workflow: Open ticket, send a query to the site, monitoring, asking foran RFO, then closing the ticket. Let’s look at prioritize BGP flaps by AS data,focus on the most critical AS The number of actual incidents is less than 20% of the total tickets 7 Can we automate PMC?

Continuous Learning, Continuous Improvement Data drives progress-with the tuning we have done to Incident, we canquickly identify trends and develop specific workflows to address them Adapt and Adopt, focus must be benefit, not process or ITSM alignment Transparency and communication matter Processes matter – Write processes down-you’ll be surprised how many hidden steps appear– On/Off Boarding is a lot more complex than we thought!– Don’t automate a bad process This stuff is hard, it takes time, commitment to outcomes, and willingnessto try something new– Challenging when groups are already heavily utilized– It can sometimes be challenging to see the value when bogged down indeployment Don’t do too much at once, try it on, live with it for a while, then feel freeto adjust Everything doesn’t fit in an ITSM process, don’t hurt innovation by trying toforce a fit for everything you do8

Thank You!Feedback? Comments?Suggestions?9

ITSM Progress over the last 18 months 4 5/11/17 ITSM Area Outcomes Incident 1. Simplified Incident input and ticket format 2. Introduced auto-assignment for specific incident types 3. Goal-more accurate insight into operations, and simpler reporting Asset 1. Completed standardization of asset record fields 2. Asset Management process defined .