Monitoring Guide

Transcription

Monitoring GuideMonitoring GuideProactive monitoring and alerting is essential to managing a healthy Couchbase environment. While theCouchbase Web Console provides detailed statistics and basic alerting functionality, it is not intended to be arealtime dashboard and shouldn't be used as the primary operational monitoring utility.Integration with external monitoring systems is required for two primary purposes: proactive alerting and highresolution trending. The external monitoring system should be capable of setting alert thresholds on a permetric basis. As the value of most metrics are workload and environment-specific, they will requireestablishing a baseline for what is "normal" for your use cases. Trending the Couchbase metrics will helpestablish the baseline values and alerts can be configured when point-in-time values exceed the "normal"range. Trended metrics also allows Couchbase administrators to observe resource consumption over time,informing when scaling events will become necessary.This document describes how to poll the Couchbase REST API to obtain metrics for an external monitoringsystem, describes which metrics are most important to monitor, and provides guidance on how to interpretthose metrics.Obtaining Couchbase MetricsCouchbase exposes monitoring metrics via REST APIs with responses returned in JSON format. There aretwo types of statistical APIs available, Cluster Manager (port 8091/18091) stats and Service specificadministrative stats.Cluster Manager stats provide statistical sampling for a given service and/or entities at a particular interval.Each response from /stats endpoint will contain a timestamp property for when the sample wastaken that will directly correlate to each of the available stats.Every Cluster Manager endpoint supports two optional query string parameters:zoomThe zoom parameter determines the interval of samples to return in the response. The zoom parameterprovides the following granularity:zoom minute (default) - Every second for the last minute (60 samples)zoom hour - Every four (4) seconds for the last hour (900 samples)zoom day - Every minute for the last day (1440 samples)zoom week - Every ten (10) minutes for the last week, actually, eight (8) days (1152 samples)zoom year - Every six (6) hours for the last year (1464 samples)Due to sample frequency, the number of samples returned are plus or minus one ( -1).haveTStampRequests statistics from this timestamp until the current time. The haveTStamp parameter is specified asUNIX epoch time in milliseconds.Couchbase Professional Services3

Monitoring GuideTo limit the results when using the zoom parameter, post-process the results. For example, if you needsamples from the last five (5) minutes, set the zoom parameter to one hour and retrieve the last 75entries from the JSON list.Polling the APIsThe REST APIs should be polled minutely via a local agent or remotely using the node(s) IP or hostname.Couchbase REST APIs must be accessed using administrative account credentials; a Read-OnlyAdministrator is recommended for this purpose.As most of the metrics provided by the REST API are per-node, it is necessary to query every node in thecluster.Limit the number of requests per API when querying metrics, i.e. return all bucket metrics in one requestrather than issuing separate requests per metric. Heavy use of the Couchbase REST APIs can have CPUutilization impacts on the cluster.Couchbase Service DiscoverySome monitoring systems are capable of discovering new monitoring targets and automatically defining themonitoring profile to be applied. Couchbase supports this by exposing cluster membership, MDS serviceassignment, and service ports via the Data Service Node API.Metrics and Services to MonitorEach section in the list describe the available monitoring metrics exposed by the Couchbase service, adescription of each metric, and possible operational responses. Alerts should be configured to be sent fromthe external monitoring system when metric values fall outside the expected range. Guidance on interpretingthe metrics and possible operational responses is provided.Each guide will contain examples of how to call an endpoint and parse the results. For these examples a toolcalled jq is used, it is a lightweight cli parser for JSON, this is not required and is provided for examplepurposes only. It can be downloaded at https://stedolan.github.io/jq/downloadMonitoring: Operating SystemMonitoring: NodesMonitoring: Data ServiceMonitoring: XDCRMonitoring: Query ServiceMonitoring: Index ServiceMonitoring: FTS ServiceMonitoring: Eventing ServiceCouchbase Professional Services4

Monitoring GuideMonitoring: LogsReference ImplementationsCouchbase provides a reference monitoring implementation to demonstrate interacting with the availableREST APIs.A sample Nagios plugin is available here.A complete dockerized monitoring environment is available here.Third Party IntegrationsThe following monitoring systems have plugins available for Couchbase. Note that these are third partyintegrations and may not be complete nor follow the best practices set forth in this document.Couchbase Node Exporter for Prometheus, see the Prometheus Integration Guide for detailsAppDynamicsDataDogDynatraceNew RelicSignalFxSensuManageEngineCouchbase Professional Services5

Monitoring: Data ServiceMonitoring: Data ServiceBuckets OverviewBuckets overview provides all available buckets, high-level system information and resource utilization foreach bucket in the cluster.Documentation: t-buckets-summary.htmlInsecure: http://localhost:8091/pools/default/bucketsSecure: leThe following example illustrates retrieving all of the buckets in a cluster and displaying basic stats abouteach bucket.curl \--user Administrator:password \--silent \--request GET \--data skipMap true \http://localhost:8091/pools/default/buckets \jq -r '.[] "Bucket: " .name "\n" "Quota Used:" (.basicStats.quotaPercentUsed tostring) "%\n" "Ops / Sec:" (.basicStats.opsPerSec tostring) "\n" "Disk Fetches:" (.basicStats.diskFetches tostring) "\n" "Item Count:" (.basicStats.itemCount tostring) "\n" "Disk Used:" (.basicStats.diskUsed / 1024 / 1024 tostring) "MB\n""Data Used:" (.basicStats.dataUsed / 1024 / 1024 tostring) "MB\n""Memory Used:" (.basicStats.memUsed / 1024 / 1024 tostring) "MB\n" 'Note: The skipMap query string parameter is a boolean value that can be used to include orexclude the current vBucket distribution map for the buckets.Individual Bucket-Level StatsCouchbase Professional Services6

Monitoring: Data ServiceBucket metrics provide detailed information about resource consumption, application workload, and internaloperations at the bucket level. The following Bucket stats are available via the Cluster-Wide or Per-NodeEndpoints listed below.Documentation: t-bucket-stats.htmlInsecure: T}/statsSecure: KET}/statsAvailable StatsStat nameDescriptionavg active timestamp driftAverage drift (in seconds) per mutation on activevBucketsavg bg wait timeAverage background fetch time in microsecondsavg disk commit timeAverage disk commit time in seconds as fromdisk update histogram of timingsavg disk update timeAverage disk update time in microseconds as fromdisk update histogram of timingsavg replica timestamp driftAverage drift (in seconds) per mutation on replicavBucketsbg wait countNumber of background fetch operationsbg wait totalBackground fetch time in microsecondsbytes readNumber of bytes per second sent into this bucketbytes writtenNumber of bytes per second sent from this bucketcas badvalNumber of CAS operations per second using anincorrect CAS ID for data that this bucket containscas hitsNumber of CAS operations per second for data that thisbucket containscas missesNumber of CAS operations per second for data that thisbucket does not containcmd getNumber of get operations serviced by this bucketcmd lookupNumber of lookup sub-document operations serviced bythis bucketcmd setNumber of set operations serviced by this bucketcouch docs actual disk sizeThe size of all data files for this bucket, including thedata itself, metadata and temporary filescouch docs data sizeThe size of active data in this bucketcouch docs disk sizeThe size of active data in this bucket on diskcouch docs fragmentationHow much fragmented data there is to be compactedcompared to real data for the data files in this bucketcouch spatial data sizeThe size of all active items in all the spatial indexes forthis bucket on diskcouch spatial disk sizeCouchbase Professional ServicesThe size of all active items in all the spatial indexes for7

Monitoring: Data Servicecouch spatial disk sizeThe size of all active items in all the spatial indexes forthis bucket on diskcouch spatial opsAll the spatial index readscouch total disk sizeThe total size on disk of all data and view files for thisbucket.couch views actual disk sizeThe size of all active items in all the indexes for thisbucket on diskcouch views data sizeThe size of active data on for all the view indexes in thisbucketcouch views disk sizeThe size of active data on for all the view indexes in thisbucket on diskcouch views fragmentationHow much fragmented data there is to be compactedcompared to real data for the view index files in thisbucketcouch views opsAll the view reads for all design documents includingscatter gather.curr connectionsNumber of connections to this server includingconnections from external client SDKs, proxies, DCPrequests and internal statistic gatheringcurr itemsNumber of unique items in this bucket - only activeitems, not replicacurr items totTotal number of items in this bucket (including replicas)decr hitsNumber of decrement operations per second for datathat this bucket containsdecr missesNumber of decr operations per second for data that thisbucket does not containdelete hitsNumber of delete operations per second for this bucketdelete missesNumber of delete operations per second for data thatthis bucket doesdisk commit countThe number of disk commentsdisk commit totalThe total time spent committing to diskdisk update countThe total number of disk updatesdisk update totalThe total time spent updating diskdisk write queueNumber of items waiting to be written to disk in thisbucketep active ahead exceptionsTotal number of ahead exceptions for all activevBucketsep active hlc driftThe sum of total abs drift for the nodes activevBucketsep active hlc drift countThe sum of total abs drift count for the nodes activevBucketsep bg fetchedNumber of reads per second from disk for this bucketep cache miss ratePercentage of reads per second to this bucket from diskas opposed to RAMCouchbase Professional Services8

Monitoring: Data Serviceep clock cas drift threshold exceededep data read failedNumber of disk read failuresep data write failedNumber of disk write failuresep dcp 2i backoffNumber of backoffs for index DCP connectionsep dcp 2i countNumber of internal second index DCP connections inthis bucketep dcp 2i items remainingNumber of secondary index items remaining to be sentto consumer in this bucketep dcp 2i items sentNumber of secondary index items per second beingsent for a producer for this bucketep dcp 2i producer countNumber of secondary index senders for this bucketep dcp 2i total backlog sizeTotal size in bytes of the DCP backlog for secondaryindexesep dcp 2i total bytesNumber of bytes per second being sent for secondaryindexes DCP connectionsep dcp cbas backoffNumber of backoffs for Analytics DCP connectionsep dcp cbas countNumber of internal Analytics DCP connections in thisbucketep dcp cbas items remainingNumber of Analytics items remaining to be sent toconsumer in this bucketep dcp cbas items sentNumber of Analytics items per second being sent for aproducer for this bucketep dcp cbas producer countNumber of Analytics senders for this bucketep dcp cbas total backlog sizeTotal size in bytes of the DCP backlog for Analyticsep dcp cbas total bytesNumber of bytes per second being sent for AnalyticsDCP connectionsep dcp eventing backoffNumber of backoffs for Eventing DCP connectionsep dcp eventing countNumber of internal Eventing DCP connections in thisbucketep dcp eventing items remainingNumber of Eventing items remaining to be sent toconsumer in this bucketep dcp eventing items sentNumber of Eventing items per second being sent for aproducer for this bucketep dcp eventing producer countNumber of Eventing senders for this bucketep dcp eventing total backlog sizeTotal size in bytes of the DCP backlog for Eventingep dcp eventing total bytesNumber of bytes per second being sent for EventingDCP connectionsep dcp fts backoffNumber of backoffs for FTS DCP connectionsep dcp fts countNumber of internal FTS DCP connections in this bucketep dcp fts items remainingNumber of FTS items remaining to be sent to consumerin this bucketNumber of FTS items per second being sent for aCouchbase Professional Services9

Monitoring: Data Serviceep dcp fts items sentNumber of FTS items per second being sent for aproducer for this bucketep dcp fts producer countNumber of FTS senders for this bucketep dcp fts total backlog sizeTotal size in bytes of the DCP backlog for FTSep dcp fts total bytesNumber of bytes per second being sent for FTS DCPconnectionsep dcp other backoffNumber of backoffs for other DCP connectionsep dcp other countNumber of other DCP connections in this bucketep dcp other items remainingNumber of items remaining to be sent to consumer inthis bucketep dcp other items sentNumber of items per second being sent for a producerfor this bucketep dcp other producer countNumber of other senders for this bucketep dcp other total backlog sizeTotal size in bytes of the DCP backlog for analyticsotherep dcp other total bytesNumber of bytes per second being sent for other DCPconnections for this bucketep dcp replica backoffNumber of backoffs for replication DCP connectionsep dcp replica countNumber of internal replication DCP connections in thisbucketep dcp replica items remainingNumber of replication items remaining to be sent toconsumer in this bucketep dcp replica items sentNumber of replication items per second being sent for aproducer for this bucketep dcp replica producer countNumber of replication senders for this bucketep dcp replica total backlog sizeTotal size in bytes of the DCP backlog for replicationep dcp replica total bytesNumber of bytes per second being sent for replicationDCP connectionsep dcp views indexes backoffNumber of backoffs for view/index DCP connectionsep dcp views indexes countNumber of internal view/index DCP connections in thisbucketep dcp views indexes items remainingNumber of view/index items remaining to be sent toconsumer in this bucketep dcp views indexes items sentNumber of view/index items per second being sent for aproducer for this bucketep dcp views indexes producer countNumber of views/index senders for this bucketep dcp views indexes total backlog sizeTotal size in bytes of the DCP backlog for views/indexesep dcp views indexes total bytesNumber of bytes per second being sent forviews/indexes DCP connectionsep dcp views backoffNumber of backoffs for view DCP connectionsep dcp views countNumber of internal view DCP connections in this bucketNumber of view items remaining to be sent to consumerCouchbase Professional Services10

Monitoring: Data Serviceep dcp views items remainingNumber of view items remaining to be sent to consumerin this bucketep dcp views items sentNumber of view items per second being sent for aproducer for this bucketep dcp views producer countNumber of view senders for this bucketep dcp views total backlog sizeTotal size in bytes of the DCP backlog for viewsep dcp views total bytesNumber of bytes per second being sent for view DCPconnectionsep dcp xdcr backoffNumber of backoffs for XDCR DCP connectionsep dcp xdcr countNumber of internal XDCR DCP connections in thisbucketep dcp xdcr items remainingNumber of XDCR items remaining to be sent toconsumer in this bucketep dcp xdcr items sentNumber of XDCR items per second being sent for aproducer for this bucketep dcp xdcr producer countNumber of XDCR senders for this bucketep dcp xdcr total backlog sizeTotal size in bytes of the DCP backlog for XDCRep dcp xdcr total bytesNumber of bytes per second being sent for XDCR DCPconnectionsep diskqueue drainTotal number of items per second being written to diskin this bucketep diskqueue fillTotal number of items per second being put on the diskqueue in thisep diskqueue itemsTotal number of items waiting to be written to disk in thisbucketep flusher todoNumber of items currently being written.ep item commit failedNumber of times a transaction failed to commit due tostorage errors.ep kv sizeTotal amount of user data cached in RAM in this bucketep max sizeThe maximum amount of memory this bucket can use.ep mem high watHigh water mark for auto-evictionsep mem low watLow water mark for auto-evictionsep meta data memoryTotal amount of item metadata consuming RAM in thisbucketep num non residentThe number of non-resident items.ep num ops del metaNumber of delete operations per second for this bucketas the target for XDCRep num ops del ret metaNumber of delRetMeta operations.ep num ops get metaNumber of metadata read operations per second for thisbucket as the target for XDCRep num ops set metaNumber of set operations per second for this bucket asthe target for XDCRCouchbase Professional Services11

Monitoring: Data Serviceep num ops set ret metaep num value ejectsTotal number of items per second being ejected to diskin this bucketep oom errorsNumber of times unrecoverable OOMs happened whileprocessing operations.ep ops createTotal number of new items being inserted into thisbucketep ops updateNumber of items updated on disk per second for thisbucketep overheadExtra memory used by transient data like persistencequeues, replication queues, checkpoints, etc.ep queue sizeNumber of items queued for storage.ep replica ahead exceptionsTotal number of ahead exceptions for all replicavBucketsep replica hlc driftThe sum of total abs drift for the node's activevBucketsep replica hlc drift countThe sum of total abs drift count for the node's activevBucketsep resident items ratePercentage of all items cached in RAM in this bucketep tmp oom errorsNumber of back-offs sent per second to client SDKs dueto "out of memory" situations from this bucketep vb totalTotal number of vBuckets for this bucketevictionsNumber of items per second evicted from this bucketget hitsNumber of get operations per second for data that thisbucket containsget missesNumber of get operations per second for data that thisbucket does not containhibernated requestsNumber of hibernated requestshibernated wakedNumber of times hibernated wakedhit ratioPercentage of get requests served with data from thisbucketincr hitsNumber of increment operations per second for datathat this bucket containsincr missesNumber of increment operations per second for datathat this bucket does not containmem usedAmount of Memory usedmissesTotal amount of operations per second for that that thebucket does not containopsTotal amount of operations per second (includingXDCR) to this bucketrest requestsswap totalswap usedCouchbase Professional Services12

Monitoring: Data Serviceswap usedvb active ejectNumber of items per second being ejected to disk from"active"vb active itm memoryAmount of active user data cached in RAM in thisbucketvb active meta data memoryAmount of active item metadata consuming RAM in thisbucketvb active numNumber of vBuckets in the "active" state for this bucketvb active num non residentNumber of non-resident items.vb active ops createNew items per second b

As most of the metrics provided by the REST API are per-node, it is necessary to query every node in the cluster. Limit the number of requests per API when querying metrics, i.e. return all bucket metrics in one request rather than issuing separate requests per metric. Heavy use of the Couchb