Transcription
A Working Theory-ofMonitoringLISA 2013Caskey L. Dickson (caskey@google.com)Site Reliability Engineer, Google, Inc.
Metrics"the assignment of numerals tothings so as to represent factsand conventions about them"– S. S. Stevens 1946
Why a “theory”Monitoring seems easyIt’s not.Why?If successful, we should be able to sensibly map many monitoringmethods/modes into a good model with fidelity.
What do we monitor? (What’s a metric?)Named value at some time. Metric identity/namek-tuple within an identity space attached to each value www-1.na-east.example.com, httpd(3321), foo.example.com, 200-ok-count hostname, process, vhost, name Metric values (overlapping)Counters, Gauges, Percentiles Nominal, Ordinal, Interval, Ratio Derived count@20131005T142155.867Z 8505936
How can we monitor? (What’s a metric?)ResolutionHow frequently are you reading a metric?Every 6 seconds? 6 Minutes?LatencyAfter reading, how long before we act on them?Seconds, Minutes, Hours?DiversityAre you collecting many different metrics?10, 25, 50, 100, 10K, 10M?
Why do we monitor? Operational Health/Response (R ,L ,D )High Resolution, Low Latency, High Diversity Quality Assurance/SLA (R ,L-,D )High Resolution, High Latency, High Diversity Capacity Planning (R-,L-,D )Low Resolution, High Latency, High Diversity Product Management (R-,L-,D-)Low Resolution, High Latency, Low DiversityWhat about these? (R-,L ,D-) LB (R-,L ,D ) (R ,L ,D-) (R ,L-,D-)
Monitoring at scaleWeb server 25 metrics/server 50 metricsmonitoringDatabase0.16 metrics/second
Monitoring at scaleWeb server 25 metrics/server 50 metrics100M active daily users 200K peak QPSmonitoringDatabase0.16 metrics/second
Monitoring at scaleWeb serverWeb serverWeb serverWeb server 25 metrics/server 50 metrics100M active daily users 200K peak QPS@ 20QPS/server 10,000 servers 25,000 metrics Web serverDatabaseDatabaseDatabaseDatabase 166 metrics/secondDatabasemonitoring
Monitoring at scale 25 metrics/server 50 metrics100M active daily users 200K peak QPS@ 20QPS/server 10,000 servers 25,000 metrics 166 metrics/secondDNS serverload balancermonitoring
Monitoring at scale 25 metrics/server 50 metrics100M active daily users 200K peak QPS@ 20QPS/server 10,000 servers 25,000 metricsX 12 ‘types’ of servers 3,000,000 metrics10,000 metrics/secondDNS serverload balancermonitoring
Monitoring at scale 25 metrics/server 50 metrics100M active daily users 200K peak QPS@ 20QPS/server 10,000 servers 25,000 metricsX 12 ‘types’ of servers 3,000,000 metricsX 8/6 sites (N 2) 4,000,000 metrics13,333 metrics/second
Monitoring at scale 25 metrics/server 50 metrics100M active daily users 200K peak QPS@ 20QPS/server 10,000 servers 25,000 metricsX 12 ‘types’ of servers 3,000,000 metricsX 8/6 sites (N 2) 4,000,000 metrics13,333 metrics/secondO(10K) metrics/secondO(32MB) data / sweep
Monitoring at scale 25 metrics/server 50 metrics100M active daily users 200K peak QPS@ 20QPS/server 10,000 servers 25,000 metricsX 12 ‘types’ of servers 3,000,000 metricsX 8/6 sites (N 2) 4,000,000 metrics13,333 metrics/secondO(10K) metrics/secondO(32MB)/sweepOps @ 1 minute O(50K) metrics/secondO(320MB)/sweepO(460GB)/24 hours
What do we monitor? (recap) Named, timestamped values of differing types Gathered at high resolution Large quantities Many different consumers(downsampling, filtering, aggregation) Reliably
CollectionAnalysis /ComputationAlerting / EscalationVisualizationStorageConfigurationSensing /Measurement
The creation of metrics at some minimum level ofabstraction. Generally raw counters plus some attributes.CollectionAnalysis /ComputationAlerting /EscalationVisualizationDifferent systems gather data at different speeds. top/ps/netstat arevery immediate, sar somewhat less so, nagios much less so.Different systems have different concepts ofan individual unit for metric identityNo consistent interfaceStorageConfigurationSensing / MeasurementSensing /Measurement
Placing of time series in a (readily?) accessible formatRaw, aggregated and post-computation metricsOccurs in different formats at different stages/var/log/syslog, /var/log/apache/access log, /var/www/mrtg/*,/var/lib/rrdb/*.rrd, mysql/postgresqlI/O throughputStructure limits analysis/visualization optionsCollectionAnalysis /ComputationAlerting eSensing /Measurement
Bringing together many individual metrics in one placeto support analysis.Metric identity needs to remain meaningfulafter aggregation.Key for scalabilityMany transports, smart and dumb.multicast, TCP, rrdcached, SFTP, rsyncCollectionAnalysis /ComputationAlerting tionSensing /Measurement
Extraction of meaning from the raw data.Often focused upon finding and detecting features or anomalies.Some anomalies are important, others are.merely interesting.CPU constrained for throughput/depthLots of interesting research in autocorrelationRAM constrained for metric volumeCollectionAnalysis /ComputationAlerting isSensing /Measurement
CollectionAnalysis /ComputationAlerting /EscalationWhen anomalies are detected, something has to deal withpromulgation of those conditions to interested parties.Some anomalies are urgent (short-term SLO critical) others are merelyimportant.Visualization“Urgent” anomalies reflect conditions that withoutimmediate operator intervention will lead to an outageor SLO excursion. Something is responsible for beingnoisy until someone comes to help.Ideally this happens as infrequently as possible.StorageConfigurationAlerting & EscalationSensing /Measurement
Meaningful visualization of the raw data can be thedifference between staying within or exceeding your SLO.CollectionAnalysis /ComputationAlerting /EscalationVisualizationViewing more than 3 dimensions can be problematic for those of uswho are still human.Goal-orientedRead and apply your Tufte/FewStorageConfigurationVisualizationSensing /Measurement
Some visualizations are less than useful.Disk space is a commonly graphed metric whichis un-actionable withoutderivatives.Not all views have the sametaxonomy.CollectionAnalysis /ComputationAlerting ization and ActionabilitySensing /Measurement
Affects every layerNeeds configuration managementComplicates distributed systemsLimits change velocityCollectionAnalysis /ComputationAlerting urationSensing /Measurement
Operational Health/Response (R ,L ,D )High Resolution, Low Latency, High Diversity Quality Assurance/SLA (R ,L-,D )High Resolution, High Latency, High Diversity Capacity Planning (R-,L-,D )Low Resolution, High Latency, High Diversity Product Management (R-,L-,D-)Low Resolution, High Latency, Low DiversityCollectionAnalysis /ComputationAlerting /EscalationVisualizationStorageConfigurationWhy do we monitor? (repeat)Sensing /Measurement
Mostly synthesized/reprocessed metrics (KPIs vs. SLIs) Lots of historic data in storage for long-term views Analysis of synthesized metrics from concrete metrics 7-day actives Conversion rates Easy to understand visualizations of resulting metricsCollectionAnalysis /ComputationAlerting t Management (R-,L-,D-)Sensing /Measurement
Evaluation of current serving capacity Calculation of proxy metrics Impact of changes to serving capacity Cost per user Efficiency Alerting when capacity limits approachingCollectionAnalysis /ComputationAlerting ty Planning (R-,L-,D )Sensing /Measurement
Includes developer supportCollectionAnalysis /ComputationAlerting /EscalationVisualization Collect data from both narrow and wide views(Sensing high resolution process behavior and system-metrics) Offline and real-time performance analysis, tracing(Collection and storage of data from diverse runs) Not necessarily real-time Useful visualizations to aid understandingStorageConfigurationQuality Assurance/SLA (R ,L-,D )Sensing /Measurement
The hardest use caseCollectionAnalysis /ComputationAlerting /EscalationVisualization Immediate, up to date metrics (low latency collection) Encompassing the entire fleet (broad collection coverage, manysensors incorporated) Real-time computation of thresholds and alerts(high speed analysis) Reliable and flexible alerting Storage of enough timeseries at high enough resolution forcomparison (XXXGB/day * 730 days) Simple configuration of global monitoring perspectiveStorageConfigurationOperational Health/Response (R ,L ,D )Sensing /Measurement
A moment please.All the systems to be discussed have inherent, undeniable value and Ihave personally used and benefited from them and mean no disrespectto the implementers and maintainers of them.Personally I use these systems, in the past I have relied upon them forproduction services I was responsible for.This is NOT a criticism of those products, rather an indication of wherethey stop short of one particular hypothetical ideal.
CollectionAnalysis /ComputationAlerting /EscalationSensing: /proc, /sys, syscalls(1)Collection: while(true);top - 18:54:30 up 67 days, 3:05, 2 users, load average: 1.60, 1.03, 0.48Analysis: SummingTasks: 113 total,1 running, 112 sleeping,0 stopped,0 zombieCpu0 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 0.7%us, 1.3%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stand sortingCpu2 : 2.7%us, 7.7%sy, 0.0%ni, 5.0%id, 84.6%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 0.0%us, 1.3%sy, 0.0%ni, 97.3%id, 1.3%wa, 0.0%hi, 0.0%si, 0.0%stAlerting: Sort to topMem:503132k total,496152k used,6980k free,41340k buffersSwap:0k total,0k used,0k free,195772k cachedVisualization: orderedPID USERPR NI VIRT RES SHR S %CPU %MEMTIME COMMAND15070 httpd200 105m 87m 1144 D8 17.80:06.18 httpdlists, dynamic sorting1032 mediatom 200 1160m 47m 816 S0 9.6 38:51.31 mediatomb6521 root200 83476 46m 36m S0 9.50:04.65 apt-get6643 caskey200 26840 8000 1616 S0 1.60:00.54 bashStorage: none6236 root200 107m 4236 3156 S0 0.80:00.06 sshd456 syslog200 244m 3280 460 S0 0.74:46.73 rsyslogdConfiguration:1303 root200 743m 3080 284 S0 0.6 14:36.75 ushare31304 root200 2042m 2584 1492 S0 0.50:00.14 console-kit-daeruntime shortcuts1 root200 24432 1768 696 S0 0.40:22.14 initVisualizationStorageConfiguration/bin/top (host process health)Sensing /Measurement
Sensing /MeasurementBasically some of top timeseriesLinux 2.6.18-194.el5PAE PMPMPMCPUallallallall%user0.000.250.750.33Linux 2.6.18-194.el5PAE (dev-db)07:28:0607:28:0707:28:0807:28:09Average:AM kbmemfree 209743262092482097432CollectionAnalysis /ComputationAlerting 00.00%system0.000.250.250.1703/26/2011%memused 79654417965441796544(8 50(8 8820488204StorageConfiguration/bin/sar (host health)
Sensing /MeasurementSensing: dtrace/strace/ltraceprocess wrapperCollection: single instanceAnalysis: NoneAlerting: N/AVisualization:NoneStorage: NoneConfiguration:command id[pid[pid[pid[pid[pidCollectionAnalysis /ComputationAlerting 3]11783]11783]11783]11783]libc start main(0x407420, 1, 0x7fff75b6aad8, 0x443cc0, 0x443d50 unfinished . geteuid() 1000getegid() 1000getuid() 1000getgid() 1000setuid(1000) 0malloc(91) 0x00cf8010XtSetLanguageProc(0, 0, 0, 0x7f968c9a3740, 1) 0x7f968bc16220ioctl(0, 21505, 0x7fff75b6a960) 0XtSetErrorHandler(0x42bbb0, 0x44f99c, 0x669f80, 146, 0x7fff75b6a72c) 0XtOpenApplication(0x670260, 0x44f99c, 0x669f80, 146, 0x7fff75b6a72c) 0xd219a0IceAddConnectionWatch(0x42adc0, 0, 0, 0x7f968c9a3748, 0 unfinished . IceConnectionNumber(0xd17ec0, 0, 1, 0xcfb138, 0xd17c00) 4 . IceAddConnectionWatch resumed ) 1XtSetErrorHandler(0, 0, 1, 0xcfb138, 0xd17c00) 0XtGetApplicationResources(0xd219a0, 0x6701c0, 0x66b220, 34, 0) 0strlen("off") 3StorageConfiguration*trace (process behavior)
Sensing:SNMP, subprocess, 2 metrics maxCollection:Centralized scraping over SMTPLocal processesAnalysis:Basic mathAlerting:NoneVisualization:day/week/month/year graphs 2 variablesCollectionAnalysis /ComputationAlerting nsing /Measurement
Operations:Ideal for netops, no alerting thoughProduct Management:NoneCapacity Planning:Ideal for network ops and host healthQ/A, SLA:NoneCollectionAnalysis /ComputationAlerting nsing /Measurement
Sensing:Subprocesses and plugins, LOTS of pluginsCollection:Centralized scrapingSupport for forwarding metricsAnalysis: At sensing timeAlerting:Configurable alarms and emailsVisualization:Basic graphs of check resultsDependency chainsCollectionAnalysis /ComputationAlerting Sensing /Measurement
Operations:Good for simple operations, basic alert supportRedundant (N M) configurations more difficultProduct Management:N/A, heavily focused on up/down checksCapacity Planning:N/AQ/A, SLA:N/A, poor/no timeseries visualizationCollectionAnalysis /ComputationAlerting Sensing /Measurement
Sensing:gmond on nodesextensions/pluginsCollection:multicast, UDP, TCP pollsAnalysis:value thresholdexternal (nagios)Storage: rrdtool/rrdcachedAlerting: N/AVisualization: ganglia-webCollectionAnalysis /ComputationAlerting aSensing /Measurement
Operations:Unsuited, no alerting built inCan feed nagios/otherProduct Management:Cluster ops focusCapacity Planning:Well suitedQ/A, SLA:Historic viewsCollectionAnalysis /ComputationAlerting aSensing /Measurement
Sensing:Poller, cron basedCollection:Primarily SNMPAnalysis:Basic summingStorage: rrdtool, MySQLAlerting: N/AVisualization:Static graphsCollectionAnalysis /ComputationAlerting /EscalationVisualizationStorageConfigurationCacti (MRTG )Sensing /Measurement
Operations:No alerts limits utility to diagnosticsProduct Management:Well suitedCapacity Planning:Well suitedQ/A, SLA:Well suitedCollectionAnalysis /ComputationAlerting /EscalationVisualizationStorageConfigurationCacti (MRTG )Sensing /Measurement
Sensing: Arbitrary JSON emitters “Checkers”Collection: RabbitMQ JSON event busAnalysis:HandlersStorage: N/AAlerting:HandlersVisualization: N/ACollectionAnalysis /ComputationAlerting ensing /Measurement
Operations:Configurable collection layer, handlers and checkersProduct Management:N/ACapacity Planning:N/AQ/A, SLA:Can feed live data to other technologiesCollectionAnalysis /ComputationAlerting ensing /Measurement
Sensing:Deployable log throwerCollection:MQ ng: N/AVisualization:Kibana (ES)CollectionAnalysis /ComputationAlerting shSensing /Measurement
Operations:Historical view of systems, searching for incident infoProduct Management:N/ACapacity Planning:N/AQ/A, SLA:Tracing of individual problem cases,cross correlation among different log setsCollectionAnalysis /ComputationAlerting shSensing /Measurement
Sensing:Custom clientsCollection:TSD RPCAnalysis:ExternalStorage:Complete storage layerAlerting: N/AVisualization: N/ACollectionAnalysis /ComputationAlerting DBSensing /Measurement
Sensing /MeasurementOperations:Can handle the volumeProduct Management:N/ACapacity Planning:N/AQ/A, SLA:N/ACollectionAnalysis /ComputationAlerting /EscalationVisualizationmysql.bytes receivedmysql.bytes sentmysql.bytes receivedmysql.bytes sentmysql.bytes receivedmysql.bytes sent1287333217 327810227706 schema foo host db11287333217 6604859181710 schema foo host db11287333232 327812421706 schema foo host db11287333232 6604901075387 schema foo host db11287333321 340899533915 schema foo host db21287333321 5506469130707 schema foo host db2StorageConfigurationOpenTSDB
Sensing: N/ACollection: N/AAnalysis: N/AStorage: N/AAlerting: N/AVisualization:Very nice interactive charts ofprepared data setsCollectionAnalysis /ComputationAlerting ensing /Measurement
Operations:Data exploration of limited valueProduct Management:Good discovery and goal seekingCapacity Planning:Interactive searching for hiddendependenciesQ/A, SLA:Great potential for exploringtraces and dependenciesCollectionAnalysis /ComputationAlerting ensing /Measurement
Sensing:DIY, name valueCollection:Custom messaging protocolAnalysis: N/AStorage: Carbon Whisperfile-per-metricAlerting: N/AVisualization:Static config of complex graphsCollectionAnalysis /ComputationAlerting teSensing /Measurement
Operations:Command-line graph creation,limited interactive webProduct Management:Great for visualizationCapacity Planning:Also good for visualizationQ/A, SLA:Can visualize, but lacks interactivityCollectionAnalysis /ComputationAlerting teSensing /Measurement
Sensing:Nagios erAnalysis:Reactioner/BrokerStorage: RRDtoolAlerting:ReactionerVisualization:Sadly not much better than NagiosCollectionAnalysis /ComputationAlerting n (Nagios Graphite CM)Sensing /Measurement
Operations:Much better CM than NagiosProduct Management:N/ACapacity Planning:N/AQ/A, SLA:N/ACollectionAnalysis /ComputationAlerting nSensing /Measurement
CollectionAnalysis /ComputationAlerting /EscalationVisualizationLots and lots of vendorsAlertSite, Bijk, CopperEgg, Dotcom Monitor, GFI Cloud, Kaseya, LogicMonitor, Monitis, MonitorGrid, Nimsoft,ManageEngine, Panopta, Pingdom, Scout, ServerDensity, Shalb SPAE, CloudTest, .SaaS offeringsRemote collection, local agents, push and pullImplementation black boxesStorageConfiguration“Cloud Monitoring”Sensing /Measurement
All of the above.Nagios Graphite Sensu Logstash GangliaInteroperability is limited at the interface layer.MQ based solutions are promising glue.Interactive graphs are inspiring.CollectionAnalysis /ComputationAlerting /EscalationVisualizationStorageConfigurationIn the Real World Sensing /Measurement
CollectionAnalysis apacityPlanningSensing /MeasurementOperationsAlerting /EscalationQA/SLAVisualization
Thanks!Sensing /MeasurementCollectionAnalysis /ComputationAlerting er,sf,github,.}Join us! go to ionsCommentsFeedbackHate Mail
AppendixExtra stuff, just in case.Here, have a sleepy cat.
100M users explained 100M userseach user uses the app 10 times a day each user access causes 10 requests actually not, because the internet users are not distributed equally around the worldand don't use the app at the same times equallyso more like 200000 queries a secondlet's say each query requires 10 disk seeks HTML page, images, dynamic requests, query flowso 10 billion requests a daymeans an average of about 100000 queries a second 1 billion user accesses per dayamortized; some use more, some use lesswhat do we need to serve that?
10K servers explained let's say a disk does about 100 disk seeks per second2000000 seeks per second mean 20000 diskswe could try cramming 20000 disks into one server but that'd be a very large and expensive server and we found out a while ago that it's more economical to use lots ofsmall servers rather than one big one also called "warehouse scale computing"at 2 disks per server, 10000 servers40 per rackfills 250 racksabout 150 meters of rack space
DNS server monitoring load balancer 25 metrics/server 50 metrics 100M active daily users 200K peak QPS @ 20QPS/server 10,000 servers 25,000 metrics X 12 'types' of servers 3,000,000 metrics 10,000 metrics/second