Tracing: Fast & Slow - USENIX

Transcription

Tracing: Fast & SlowDigging into and improving your webservice’s performanceOffline viewers: full write up @ rogue.ly/tracingLynn Root SRE @roguelynn

whoami

agenda—

agenda— Overview and problem space

agenda— Overview and problem spaceApproaches to tracing

agenda— Overview and problem spaceApproaches to tracingTracing at scale

agenda— Overview and problem spaceApproaches to tracingTracing at scaleDiagnosing performance issues

agenda— Overview and problem spaceApproaches to tracingTracing at scaleDiagnosing performance issuesTracing services & systems

Tracing Overview—

machine-centric— Focus on a single machine

machine-centric— Focus on a single machineNo view into a service’s dependencies

workflow-centric— Understand causal relationships

workflow-centric— Understand causal relationshipsEnd-to-end tracing

why trace?—

why trace?— Performance analysis

why trace?— Performance analysisAnomaly detection

why trace?— Performance analysisAnomaly detectionProfiling

why trace?— Performance analysisAnomaly detectionProfilingResource attribution

why trace?— Performance analysisAnomaly detectionProfilingResource attributionWorkload modeling

Tracing Approaches—

manual

def request id(f):@wraps(f)def decorated(*args, **kwargs):req id request.headers.get("X-Request-Id", uuid.uuid4())return f(req id, *args, **kwargs)return decorated@app.route("/")@request iddef list services(req id):# log w/ ID for wherever you want to trace# app logic

upstream appserver {10.0.0.0:80;}server {listen 80;# Return to clientadd header X-Request-ID request id;location / {proxy pass http://appserver;# Pass to app serverproxy set header X-Request-ID request id;}}

log format trace ' remote addr request id';server {listen 80;add header X-Request-ID request id;location / {proxy pass http://app server;proxy set header X-Request-ID request id;# Log request idaccess log /var/log/nginx/access trace.log trace;}}

blackbox

metadata propagation

Tracing at Scale—

four things to think about—

four things to think about— What relationships to track

four things to think about— What relationships to trackHow to track them

four things to think about— What relationships to trackHow to track themWhich sampling approach to take

four things to think about— What relationships to trackHow to track themWhich sampling approach to takeHow to visualize

what to track

Request OneRequest TwoSubmitter Flow PoV

Request OneRequest TwoTrigger Flow PoV

how to track

request ID

request ID logical clock

request ID logical clock previous trace points

tradeoffs—

tradeoffs— Payload size

tradeoffs— Payload sizeExplicit relationships

tradeoffs— Payload sizeExplicit relationshipsCollate despite lost data

tradeoffs— Payload sizeExplicit relationshipsCollate despite lost dataImmediate availability

how to sample

sampling approaches— Head-based

sampling approaches— Head-basedTail-based

sampling approaches— Head-basedTail-basedUnitary

what to visualize

gantt chart—Trace ID: de4db33fGET /homeGET /feedGET /profileGET /messagesGET /friends

request flow graph—2200µsB call500µs400µsA call500µsD call500µsC callD reply400µsC reply800µsE call1500µs600µs700µs600µsC call300µsB replyE replyA reply500µs100µsC reply

context calling tree—CBADCE

keep in mind— What do I want to know?

keep in mind— What do I want to know?How much can I instrument?

keep in mind— What do I want to know?How much can I instrument?How much do I want to know?

suggested for performance—

suggested for performance— Trigger PoV

suggested for performance— Trigger PoVHead-based sampling

suggested for performance— Trigger PoVHead-based samplingFlow graphs

Diagnosing—

questions to ask— Batch requests?

questions to ask— Batch requests?Any parallelization opportunities?

questions to ask— Batch requests?Any parallelization opportunities?Useful to add/fix caching?

questions to ask— Batch requests?Any parallelization opportunities?Useful to add/fix caching?Frontend resource loading?

questions to ask— Batch requests?Any parallelization opportunities?Useful to add/fix caching?Frontend resource loading?Chunked or JIT responses?

Frameworks,Systems & Services—

Frameworks

OpenTracing

OpenCensus

self-hosted

Zipkin (Twitter)—

Zipkin (Twitter)— Out-of-band reporting to remote collector

Zipkin (Twitter)— Out-of-band reporting to remote collectorReport via HTTP, Kafka, and Scribe

Zipkin (Twitter)— Out-of-band reporting to remote collectorReport via HTTP, Kafka, and ScribeVarying propagation support across differentlanguages

Zipkin (Twitter)— Out-of-band reporting to remote collectorReport via HTTP, Kafka, and ScribeVarying propagation support across differentlanguagesLimited web UI

def http transport(span v1/spans",data span data,headers {"Content-type": "application/x-thrift"})@app.route("/")def index():with zipkin span(service name "myawesomeapp",span name "index",# need to write own transport functransport handler http transport,port app port,# 0-100 percentsample rate 100):# do something

Jaeger (Uber)—

Jaeger (Uber)— Local daemon to collect & report

Jaeger (Uber)— Local daemon to collect & reportStorage support for only Cassandra

Jaeger (Uber)— Local daemon to collect & reportStorage support for only CassandraLimited Web UI

Jaeger (Uber)— Local daemon to collect & reportStorage support for only CassandraLimited Web UIVarying language support in client libraries

config Config( )tracer config.initialize tracer()@app.route("/")def index():with tracer.start span("ASpan") as span:span.log kv({"event": "the answer to", "life": 42})with tracer.start span("ChildSpan", child of span) as cspan:cspan.log kv({"event": "don't forget", "towel": True})

honorable mentions— AppDash

services

Stackdriver Trace (Google)—

Stackdriver Trace (Google)— OpenCensus with gRPC support

Stackdriver Trace (Google)— OpenCensus with gRPC supportForward traces from Zipkin

Stackdriver Trace (Google)— OpenCensus with gRPC supportForward traces from ZipkinStorage limitation of 30 days

Stackdriver Trace (Google)— OpenCensus with gRPC supportForward traces from ZipkinStorage limitation of 30 daysRecreate graphs per time period

X-Ray (AWS)—

X-Ray (AWS)— Supports OpenCensus, not OpenTracing

X-Ray (AWS)— Supports OpenCensus, not OpenTracingGrowing SDK support across languages

X-Ray (AWS)— Supports OpenCensus, not OpenTracingGrowing SDK support across languagesLots of flexibility with configuring sampling

X-Ray (AWS)— Supports OpenCensus, not OpenTracingGrowing SDK support across languagesLots of flexibility with configuring samplingSend metrics from outside AWS environment

X-Ray (AWS)— Supports OpenCensus, not OpenTracingGrowing SDK support across languagesLots of flexibility with configuring samplingSend metrics from outside AWS environmentFlow graphs with latency, response %, sample %

honorable mentions— LightStepSignalFXNew RelicDatadogAzure Monitor

TL;DR—

tl;dr— You need this

tl;dr— You need this – but it’s hard

tl;dr— You need this – but it’s hardSupport is improving

tl;dr— You need this – but it’s hardSupport is improvingOne size fits all approaches

tl;dr— You need this – but it’s hardSupport is improvingOne size fits all approachesIt’s in the open

Thanks!—Write up: rogue.ly/tracingVisit our booth – we’re hiring!Lynn Root SRE @roguelynn

Tracing: Fast & Slow Digging into and improving your web service’s performance O!ine viewers: full write up @ rogue.l