Performance*Checklists* For*SREs* - Brendan Gregg

Transcription

PerformanceChecklistsforSREsBrendan GreggSenior Performance Architect

PerformanceChecklistsper instance:1.uptime2.dmesg -T tail3.vmstat 14.mpstat -P ALL 15.pidstat 16.iostat -xz 17.free -m8.sar -n DEV 19.sar -n TCP,ETCP 110. topcloud S6.LoadAvg7.JavaHeap8.ParNew9.Latency10.99thQle

BrendantheSRE On the Perf Eng team & primary on-call rotation for Core:our central SRE team– we get paged on SPS dips (starts per second) & more In this talk I'll condense some perf engineering into SREtimescales (minutes) using checklists

PerformanceEngineering! SREPerformanceIncidentResponse

PerformanceEngineering Aim: best price/performance possible– Can be endless: continual improvement Fixes can take hours, days, weeks, months– Time to read docs & source code, experiment– Can take on large projects no single team would staff Usually no prior "good" state– No spot the difference. No starting point.– Is now "good" or "bad"? Experience/instinct helps Solo/team workAt Netflix: The Performance Engineering team, with help fromdevelopers 3

PerformanceEngineering

PerformanceEngineeringstat ationsource codetuningPMCsprofilersflame graphs

SREPerfIncidentResponse Aim: resolve issue in minutes– Quick resolution is king. Can scale up, roll back, redirect traffic.– Must cope under pressure, and at 3am Previously was in a "good" state– Spot the difference with historical graphs Get immediate help from all staff– Must be social Reliability & perf issues often relatedAt Netflix, the Core team (5 SREs), with immediate helpfrom developers and performance engineers 1

SREPerfIncidentResponse

SREPerfIncidentResponsecustom dashboardsdistributed system tracingcentral event logschat roomspagerticket system

NeSlixCloudAnalysisProcessIn summary Example SREresponse ashboards2.CheckEventsChronosCreateNewAlertPlus some othertools not toolsSalp

TheNeedforChecklists SpeedCompletenessA Starting PointAn Ending PointReliabilityTrainingPerf checklists have historicallybeen created for perf engineering(hours) not SRE response (minutes)More on checklists: Gawande, A.,The Checklist Manifesto. MetropolitanBooks, 2008Boeing707EmergencyChecklist(1969)

SREChecklistsatNeSlix Some shared docs– PRE Triage Methodology– go/triage: a checklist of dashboards Most "checklists" are really custom dashboards– Selected metrics for both reliability and performance I maintain my own per-service and per-device checklists

SREPerformanceChecklistsThe following are: Cloud performance checklists/dashboards SSH/Linux checklists (lowest common denominator) Methodologies for deriving cloud/instance checklistsAd HocMethodologyChecklistsDashboardsIncluding aspirational: what we want to do & build as dashboards

��c

PRETriageChecklist Performance and Reliability Engineering checklist– Shared doc with a hierarchal checklist with 66 steps total1. Initial Impact1.2.3.4.record timestampquantify: SPS, signups, support callscheck impact: regional or global?check devices: device specific?2. Time Correlations1. pretriage dashboard1. check for suspect NIWS client: error rates2. check for source of error/request rate change3. [ dashboard specifics ]Confirms, quantifies,& narrows problem.Helps you reasonabout the cause.

PRETriageChecklist.cont. 3. Evaluate Service Health– perfvitals dashboard– mogul dependency correlation– by cluster/asg/node: latency: avg, 90 percentilerequest rateCPU: utilization, sys/userJava heap: GC rate, leaksmemoryload averagethread contention (from Java)JVM crashesnetwork: tput, sockets[ ]custom dashboards

2.predashIniQaldashboardNeSlixspecific

predashPerformance and Reliability Engineering dashboardA list of selected dashboards suited for incident response

predashList of dashboards is its own checklist:1. Overview2. Client stats3. Client errors & retries4. NIWS HTTP errors5. NIWS Errors by code6. DRM request overview7. DoS attack metrics8. Push map9. Cluster status.

3.perfvitalsServicedashboard

le

le

turationInstances

peerrors,Qmeouts,retriesresponseQmeaverage,99th- caleup/down?count,state,versionAll time series, for every application, and dependencies.Draw a functional diagram with the entire data path.Same as Google's "Four Golden Signals" (Latency, Traffic,Errors, Saturation), with instances added due to cloud– Beyer, B., Jones, C., Petoff, J., Murphy, N. Site Reliability Engineering.O'Reilly, Apr 2016

5.BadInstanceDashboardAnAn - ‐Methodology

BadInstanceDashboard1.2.3.4.Plot request time per-instanceFind the bad instanceTerminate bad instanceSomeone else’s problem now!In SRE incident response, if it works,do Exploder)

LotsMoreDashboardsWe have countless more,mostly app specific andreliability focused Most reliability incidentsinvolve time correlation with acentral log systemSometimes, dashboards &monitoring aren't enough.Time for SSH.NIWS HTTP errors:ErrorTypesRegionsAppsTime

6.LinuxPerformanceAnalysisin60,000milliseconds

mesg -T tailvmstat 1mpstat -P ALL 1pidstat 1iostat -xz 1free -msar -n DEV 1sar -n TCP,ETCP 1top

mesg -T tailvmstat 1mpstat -P ALL 1pidstat 1iostat -xz 1free -msar -n DEV 1sar -n TCP,ETCP linux- ‐performance- ‐analysis- ‐in- ‐60s.html

60s:upQme,dmesg,vmstat uptime23:51:26 up 21:31, dmesg 3408][2320864.954447]1 user,load average: 30.02, 26.43, 19.02perl invoked oom-killer: gfp mask 0x280da, order 0, oom score adj 0Out of memory: Kill process 18694 (perl) score 246 or sacrifice childKilled process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kBTCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters. vmstat 1procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu----r b swpdfreebuff cachesisobiboincs us sy id wa st34 00 200889792 73708 5918280005610 96 1 3 0 032 00 200889920 73708 591860000592 13284 4282 98 1 1 0 032 00 200890112 73708 5918600000 9501 2154 99 1 0 0 032 00 200889568 73712 59185600048 11900 2459 99 0 0 0 032 00 200890208 73712 5918600000 15898 4840 98 1 1 0 0 C

60s:mpstat mpstat -P ALL 1Linux 3.13.0-49-generic ys 14/2015%irq0.000.000.000.000.00x86 64 (32 .000.00%idle0.780.992.001.003.03

60s:pidstat pidstat 1Linux 3.13.0-49-generic (titanclusters-xxxxx)07/14/2015x86 64(32 0307:41:03PMUIDPM0PM0PM0PM0PM0PM 60004PID%usr %system90.000.9442145.665.6643540.940.946521 1596.231.896564 20.001.890.00 1598.110.00 :0407:41:0407:41:04 CPMUIDPM0PM0PM0PM108PM 60004PID%usr %system42146.002.006521 1590.001.006564 08.000.00 1591.000.00 vejavajavasnmp-passpidstat

60s:iostat iostat -xmdz 1Linux 3.13.0-29 3.00129.000.0008/18/2014wrqm/sr/s0.000.000.00 15299.000.00 15271.000.00 31082.00w/s0.000.003.003.00x86 64rMB/s0.00338.17336.65678.45(16 CPU)wMB/s0.000.000.010.01\/\/\.Workload.\ avgqu-sz/0.00\126.09/99.31\0.00await r await w 0086.4086.000.00

60s:free,sar–nDEV free -mtotalMem:245998-/ 30shared83 sar -n DEV 1Linux 3.13.0-49-generic (titanclusters-xxxxx) 07/14/2015buffers59x86 64cached541(32 CPU)12:16:4812:16:4912:16:4912:16:49AMAMAMAMIFACE rxpck/seth0 18763.00lo14.00docker00.00txpck/srxkB/s5032.00 20686.4214.001.360.000.00txkB/s rxcmp/s txcmp/s rxmcst/s 0 CAMAMAMAMIFACE rxpck/seth0 19763.00lo20.00docker00.00txpck/srxkB/s5101.00 21999.1020.003.250.000.00txkB/s rxcmp/s txcmp/s rxmcst/s .000.000.000.000.00

60s:sar–nTCP,ETCP sar -n TCP,ETCP 1Linux 3.13.0-49-generic (titanclusters-xxxxx)(32 CPU)12:17:19 AM12:17:20 AMactive/s passive/s1.000.0012:17:19 AM12:17:20 AMatmptf/s0.0012:17:20 AM12:17:21 AMactive/s passive/s1.000.0012:17:20 AM12:17:21 AM 0estres/s retrans/s isegerr/s0.000.000.00iseg/s8359.00x86 64orsts/s0.00oseg/s6039.00estres/s retrans/s isegerr/s0.000.000.00orsts/s0.00

60s:top toptop - 00:15:40 up 21:56, 1 user, load average: 31.09, 29.87, 29.92Tasks: 871 total,1 running, 868 sleeping,0 stopped,2 zombie%Cpu(s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 stKiB Mem: 25190241 total, 24921688 used, 22698073 free,60448 buffersKiB Swap:0 total,0 used,0 free.554208 cached ncl rootrootrootrootrootrootrootrootPR NIVIRTRES200 0.227t 0.012t200 2722544 64640200243442332200 38.227g 547004200 20.015g 2.682g20033620292020000200000 0SSSRSSSSSSSS%CPU %MEMTIME COMMAND3090 5.2 29812:58 java23.5 0.0 233:35.37 mesos-slave1.0 0.00:00.07 top0.7 0.22:02.74 java0.3 1.1 33:14.42 java0.0 0.00:03.82 init0.0 0.00:00.02 kthreadd0.0 0.00:05.35 ksoftirqd/00.0 0.00:00.00 kworker/0:0H0.0 0.00:06.94 kworker/u256:00.0 0.02:38.05 rcu sched

OtherAnalysisin60s We need such checklists for:–––––JavaCassandraMySQLNginxetc Can follow a methodology:––––Process of eliminationWorkload characterizationDifferential diagnosisSome summaries: http://www.brendangregg.com/methodology.html Turn checklists into dashboards (many do exist)

7.LinuxDiskChecklist

LinuxDiskChecklist1.2.3.4.5.6.7.8.9.iostat –xnz 1vmstat 1df -hext4slower 10bioslower 10ext4dist 1biolatency 1cat /sys/devices/ /ioerr cntsmartctl -l error /dev/sda1

LinuxDiskChecklist1.2.3.4.5.6.7.8.9.iostat –xnz 1anydiskI/O?ifnot,stoplookingvmstat 1isthisswapping?or,highsysQme?df -harefilesystemsnearlyfull?ext4slower 10(zfs*,xfs*,etc.)slowfilesystemI/O?bioslower 10ifso,checkdiskscheckdistribuQonandrateext4dist 1biolatency 1ifinteresQng,checkdisks(ifavailable)errorscat /sys/devices/ /ioerr cntsmartctl -l error /dev/sda1(ifavailable)errorsAnother short checklist. Won't solve everything. FS focused.ext4slower/dist, bioslower, are from bcc/BPF tools.

ext4slower ext4 operations slower than the threshold:# ./ext4slower 1Tracing ext4 operationsTIMECOMM06:49:17 bash06:49:17 cksum06:49:17 cksum06:49:17 cksum06:49:17 cksum06:49:17 cksum06:49:17 cksum06:49:17 cksum06:49:17 cksum[ ]slowerPID361636163616361636163616361636163616than 1 msT BYTESR 128R 39552R 96R 96R 10320R 65536R 55400R 36792R 15008OFF aclocal-1.14acpi listen Better indicator of application pain than disk I/O Measures & filters in-kernel for efficiency using BPF– From https://github.com/iovisor/bcc

BPFiscoming Freeyourmind

BPF That file system checklist should be a dashboard:– FS & disk latency histograms, heatmaps, IOPS, outlier log Now possible with enhanced BPF (Berkeley Packet Filter)– Built into Linux 4.x: dynamic tracing, filters, histogramsSystem dashboards of 2017 should look very different

8.LinuxNetworkChecklist

LinuxNetworkChecklist1.2.3.4.5.6.7.8.9.10.sar -n DEV,EDEV 1sar -n TCP,ETCP 1cat /etc/resolv.confmpstat -P ALL 1tcpretranstcpconnecttcpacceptnetstat -rnvcheck firewall confignetstat -s

LinuxNetworkChecklist1.2.3.4.5.6.7.8.9.10.sar -n DEV,EDEV 1sar -n TCP,ETCP 1cat /etc/resolv.confmpstat -P ALL 1tcpretranstcpconnecttcpacceptnetstat -rnvcheck firewall confignetstat -stcp*, are from bcc/BPF p

tcpretrans Just trace kernel TCP retransmit functions for efficiency:# ./tcpretransTIMEPID01:55:05 001:55:05 001:55:17 0[ 210.153.223.157:22T R R R HED From either bcc (BPF) or perf-tools (ftrace, older kernels)

9.LinuxCPUChecklist

(too many lines – should be a utilization heat map)

et.html

perf script[ ]java 14327java 14315java 14310java 14332java 14341java 14341java 14341java 14341java 14341java 14341[ 6ba807e7f36571922e87f3656ec4960SpinPause (/usr/lib/jvm/java-8SpinPause (/usr/lib/jvm/java-8SpinPause (/usr/lib/jvm/java-8pthread cond wait@@GLIBC 2.3.2ClassLoaderDataGraph::do unloaClassLoaderData::free deallocanmethod::do unloading(BoolObjeAssembler::locate operand(unsinmethod::do unloading(BoolObjeCodeHeap::block start(void*) c

LinuxCPUChecklist1.2.3.4.5.6.7.uptimevmstat 1mpstat -P ALL 1pidstat 1CPU flame graphCPU subsecond offset heat mapperf stat -a -- sleep 10

mstat 1system- ‐wideuQlizaQon,runqlengthmpstat -P ALL 1CPUbalancepidstat 1per- ‐processCPUCPU flame graphCPUprofilingmaplookforgapsCPU subsecond offset heatperf stat -a -- sleep10IPC,LLChitraQohtop can do 1-4

htop

CPUFlameGraph

perf eventsCPUFlameGraphs We have this automated in Netflix Vector:git clone --depth 1 https://github.com/brendangregg/FlameGraphcd FlameGraphperf record -F 99 -a –g -- sleep 30perf script ./stackcollapse-perf.pl ./flamegraph.pl perf.svg Flame graph interpretation:––––x-axis: alphabetical stack sort, to maximize mergingy-axis: stack depthcolor: random, or hue can be a dimension (eg, diff)Top edge is on-CPU, beneath it is ancestry Can also do Java & Node.js. Differentials. We're working on a d3 version for Vector

10.ToolsMethodAnAn - ‐Methodology

ToolsMethod1. RUN EVERYTHING AND HOPE FOR THE BESTFor SRE response: a mental checklist to see what mighthave been missed (no time to run them all)

LinuxPerfObservabilityTools

LinuxStaQcPerformanceTools

Linuxperf- ‐tools(mrace,perf)

Linuxbcctools(BPF)NeedsLinux4.xCONFIG BPF SYSCALL y

11.USEMethodAMethodology

TheUSEMethod For every resource, lizaQon(%) Definitions:– Utilization: busy time– Saturation: queue length or queued time– Errors: easy to interpret (objective)Used to generate checklists. Starts with the questions,then finds the tools.

USEMethodforHardware For every resource, check:1.2.3.UtilizationSaturationErrors Including busses & interconnects

(hap://www.brendangregg.com/USEmethod/use- ‐linux.html)

USEMethodforDistributedSystems Draw a service diagram, and for every service:1.2.3.Utilization: resource usage (CPU, network)Saturation: request queueing, timeoutsErrors Turn into a dashboard

NeSlixVector Real time instance analysis tool– https://github.com/netflix/vector– ctor-netflixs-on-host.html USE method-inspired metrics– More in development, incl. flame graphs

NeSlixVector

urationutilizationsaturation

12.Bonus:ExternalFactorChecklist

ExternalFactorChecklist1.2.3.4.5.6.7.Sports ball?Power outage?Snow storm?Internet/ISP down?Vendor firmware update?Public holiday/celebration?Chaos Kong?Social media searches (Twitter) often useful– Can also be NSFW

TakeAways Checklists are great– Speed, Completeness, Starting/Ending Point, Training– Can be ad hoc, or from a methodology (USE method) Service dashboards– Serve as checklists– Metrics: Load, Errors, Latency, Saturation, Instances System dashboards with Linux BPF– Latency histograms & heatmaps, etc. Free your mind.Please create and share more checklists

References Netflix Tech Blog: Linux Performance & BPF tools: aphs.htmlHeat maps: mlFlame Graphs: github.com/iovisor/bcc#toolsUSE Method Linux: ttp://www.brendangregg.com/heatmaps.htmlBooks: Beyer, B., et al. Site Reliability Engineering. O'Reilly, Apr 2016 Gawande, A. The Checklist Manifesto. Metropolitan Books, 2008 Gregg, B. Systems Performance. Prentice Hall, 2013 (more checklists & methods!)Thanks: Netflix Perf & Core teams for predash, pretriage, Vector, etc

tflix is hiring SREs!

The Checklist Manifesto. Metropolitan Books, 2008 Boeing’707’Emergency’Checklist(1969)’ SRE*Checklists*atNeSlix* Some shared docs – PRE Triage Methodology – go/triage: a checklist of dashboards Most "checklists" are really custom dashboards – Selected metrics for bot