Transcription
Microsoft’s DemonDatacenter Scale Distributed Ethernet Monitoring ApplianceRich GrovesPrincipal ArchitectMicrosoft GNS1Bill BenettiSenior Service EngineerMicrosoft MSIT
Before We Begin 2We are Network Engineers.This isn’t a Microsoft product.We are here to share methods and knowledge.Hopefully we can all foster evolution in the industry.
Microsoft is a great place to work! 3We need experts like you.We have larger than life problems to solve.Networking is important and well funded.Washington is beautiful.
The Microsoft Demon Technical Team 4Rich GrovesBill BenettiDylan GreeneJustin ScottKen HollisTanya OllickEric Chou
About Rich Groves Microsoft’s Global Network ServicesNOUS – Network of Unusual Scale Microsoft ITEOUS – Enterprise of Unusual ScaleAr#st’s Approxima#on Time Warner Cable EndaceMade cards, systems, software for “Snifferguys” AOL“Snifferguy” 5MCI
The Traditional Network Hierarchical Tree Structure – Op#mized for N- ‐S traffic6hierarchical tree optimized for north/south trafficfirewalls, load balancers, and WANoptimizersnot much cross datacenter trafficlots of traffic localized in the top ofrack
Analyzing the Traditional Network insert taps within the aggregationport mirror at the top of rackcapture packets at the loadbalancerwell understood but costly at scale7
The Cloud Datacenter tuned for massive crossdata center traffic appliances removed forsoftware equivalents8
Can you tap this cost effectively? 8,16,and 32x10g uplinks Tapping 32x10g ports requires 64 portsto aggregate.(Who can afford buying current systems for that?) ERSpan could be used, but it impactsproduction traffic. Even port mirrors are a difficult taskat this scale.9
Many attempts at making this work Capturenet-complex to managepurpose built aggregation devices were far too expensive at scaleresulted in lots of gear gathering dust PMA - “Passive Measurement Architecture”-failed due to boring namerebranded as PUMA by outside marketing consultant (Rich’s eldest daughter) PUMA-lower cost than Capturenetextremely feature richtoo costly at scale Pretty Pink PUMA10attempt at rebranding by Rich’s youngest daughterrejected by the team
Solution 1: Off the Shelf used 100% purpose built aggregation gear supported many higher end features (timestamping,slicing,etc) price per port is far too high not dense enough (doesn’t even terminate one tap strip) high cost made tool purchases impossible no point without tools11
Solution 2: Cascading Port MirrorsHow mirror all attached monitor ports to nextlayerpre-filter by only mirroring interfaces youwish to seemonitor portsI heardpackets1,2,3,4The Upside cost effectiveuses familiar equipmentcan be done using standard CLIcommands in a configThe Downsidecontrol traffic removed by some switchesassumes you know where to find the datalack of granular controluses different pathways in the switchquantity of port mirror targets is limited 12switchswitchI’m not allowedto tell anyoneabout packet2switchI heardpackets1,3,4host
Solution 3: Making a Big HubHow turn off learningflood on all portsunique outer VLAN tag per port usingQinQpre-filter based on ingress port throughVLAN pruningmonitor portsswitchswitchUpside cost effectiveDownside 13Control traffic is still intercepted by the switch.Performance is non-deterministic.Some switches need SDK scripts to make this work.Data quality suffers.switch
The End Well not really, but it felt like it.14
Core Aggregator Functionsterminates links Let’s solve 80 percent of the problem:do- ‐able in merchant silicon switch5-tuple pre-filters terminates linkschips 5-tuple pre-filtersduplication duplicationforwarding without modification forwarding without modificationlow latency low latencyzero loss zero losstime stampscostly due to lack of demandframe slicing outside of the aggregator space15
Reversing the AggregatorThe Basic Logical Components terminate links of all types and a lot of them 16low latency and losslessN:1, 1:N duplicationsome level of filteringcontrol plane for driving the device
What do these platforms have in common?Can you spot thecommercialaggregator ?17
Introducing Merchant Silicon SwitchesAdvantages of merchant silicon chips: more ports per chip (64x10g currently) much lower latency(due to fewer chip crossings) consume less power more reliable than traditional ASIC basedmulti-chip designs18
Merchant Silicon EvolutionYear200720112013201510G on nm28nmInterface speed evolu#on: 40G, 100G, 400G(?), 1TbpsThis is a single chip. Amazingly dense switches are created using mul#ple chips.19
Reversing the AggregatorThe Basic Logical Components terminate links of all types low latency and lossless N:1, 1:N duplication some level of filtering control plane for driving the device20
Port to Port Characteristics of Merchant SiliconLatency port to port (within the chip)Loss within the aggregator isn’t acceptable.Such deterministic behavior makes a single chip system ideal as an aggregator.21
Reversing the AggregatorThe Basic Logical Components terminate links of all types low latency and lossless N:1, 1:N duplication some level of filtering control plane for driving the device22
Duplication and FilteringDuplica#on line rate duplication in hardware to all portsfacilitates 1:N, N:1, N:N duplication and aggregationFiltering 23line rate L2/L3/L4 filtering on all portsthousands of filters depending on the chip type
Reversing the AggregatorThe Basic Logical Components terminate links of all typeslow latency and losslessN:1, 1:N duplicationsome level of filtering control plane for driving the device24
Openflow as a Control PlaneWhat is Openflow? remote API for control allows an external controller to manage L2/L3 forwarding and some headermanipulation runs as an agent on the switch developed at Stanford 2007-2010 now managed by the Open Networking Foundation
Control Bus (Proprietary control protocol)Common Network DeviceData PlaneSupervisor!Control PlaneSupervisor!Data Plane
Supervisor (OpenFlow Agent)!PriorityMatchAc;onList300TCP.dst 80Fwd:port 5100IP.dst 192.8/16Queue: 2400*DROPOpenFlow Controller"Supervisor (OpenFlow Agent)!Flow TableControl BusFlow TableController Programs Switch’s “Flow Tables”PriorityMatchAc;onList500TCP.dst 22TTL- ‐- ‐,Fwd:port 3200IP.dst 128.8/16Queue: 4100*DROP
Proactive Flow Entry Creation“match xyz, rewrite VLAN, forward to port 15”Controller"“match xyz, rewrite VLAN, forward to port 42”10.0.1.2!10.0.1.2!10.0.1.2!
Openflow 1.0 Match Primitives(Demon Related)Match Types ingress portsrc/dst MACsrc/dst IPethertypeprotocolsrc/dst portTOSVLAN IDVLAN PriorityAction Types mod VLAN IDdropoutputcontroller
Flow Table Entries “if,then,else”if “ingress port 24 and ethertype 2048(IP) and dest IP 10.1.1.1”then “dest mac 00:11:22:33:44:55 and output port1”if “ethertype 2054(ARP) and src IP 10.1.1.1”then “output rt10”if “ethertype 2048(IP) and protocol 1(ICMP)”then “controller”
Openflow 1.0 Limitations lack of QinQ support lack of basic IPv6 support no deep IPv6 match support can redirect based on protocol number (ether-type) no layer 4 support beyond port number cannot match on TCP flags or payloads31
Multi-Tenant Distributed Ethernet Monitoring ApplianceEnabling Packet Capture and Analysis at Datacenter Scalemonitor ports4.8 Tbps of filtering capacityfind the needle in the haystackfilterserviceIndustry Standard CLImuxmore than 20X cheaperthan “off the shelf” solu#onsfilterserviceDemon applianceself serveusing a RESTful APIdeliverytoolingsave valuable router resourcesleveraging Openflowusing theDemon packet sampling offloadfilter and deliver to any“Demonized” datacenter evento hopboxes and Azureformodular scale and granular controlbased on low- ‐costmerchant silicon
Filter Layermonitor portsfilterterminates inputs from1,10,40g portsmuxservicefilterservicedelivery filter switches have 60 filter interfaces facing monitor ports filter interfaces allow only inbound traffic through the use of highpriority flow entries 33ini#ally drops all trafficinbound4x10g infrastructure interfaces are used as egress toward the muxapproximately 1000 L3/L4Flows per switchperforms longest match filtershigh rate sFlow sampling withno ”produc#on impact”
Mux Layermonitor portsterminates 4x10g infrastructureports from each filter switchperforms shortest match filtersmuxfilterservicefilterserviceprovides both service node anddelivery connec#vitydeliveryduplicates flows downstream ifneededtooling introduces pre-service and post-service ports used to aggregate all filter switches directs traffic to either service node or delivery interfaces34
Services Nodesmonitor portsleverage higher end features on asmaller set of portspossible uses:filtermuxservicefilterservicedelivery connected to mux switch through pre-service and post-service ports performs optional functions that Openflow and merchant silicon cannotcurrently provide35 deeper filteringtime stampingframe slicingencapsulation removal for tunnelinspectionconfigurable logginghigher resolution samplingencryption removalpayload removal for complianceencapsulation of output for locationindependence
Delivery Layermonitor ports1:N and N:1 duplica#ondata delivery to ing introduces delivery interfaces which connect tools to Demon can optionally fold into mux switch depending on tool quantity andlocation36further filtering if needed
Advanced Controller Actionscontroller"API"receives packets and octets of allflows createdabove used as rough trigger forautomated packet capturesDemonapplica#onCLIAPIduplicate LLDP, CDP, and ARPtraffic to the controller at lowpriority to collect topologyinforma#onsource “Tracer” documenta#onpackets to describe the trace37
Location Aware Demon Policy monitor portsdrops bydefaultport1 of filter1 to port 1 of delivery1”filterfilter1high priorityflow lica#onserviceCLIAPIdeliveryuser38policy created using CLI or API“forward all traffic matching tcp dest 80 on Demon app creates flows thoughcontroller API controller pushes a flow entry tofilter1,mux,and delivery to outputusing available downstream links traffic gets to the wireshark system
Location Independent Demon Policydrops by defaulton all ingressinterfacesmonitor portsingress Vlan tag isrewrioenfilter1high priorityflow is createdon all ser39policy created using CLI or API if TCP dst port 80 on any ingressport on any filter switch then addlocation meta-data and deliver todelivery1 Ingress VLAN tag is rewritten toadd substrate locale info anduniqueness to duplicate packets. Traffic gets to Wireshark.filtercontroller"API"service
Inserting a Service Node policy created using CLI or APIforward all traffic matching tcp dest 80on port1 of filter1 to port 1 of delivery1and use service node “timestamping””monitor portsmux usesservice node asegress for rviceCLIAPIdelivery#mestamp is addedto frame and senttoward mux40flows created per policy on thefilter and mux to use the servicenode as egress traffic gets to Wiresharkfilterflows createdbased on policyservice mux sends servicenode sourced trafficto delivery switchuser
Advanced Use Case 1:Closed Loop Data Collection monitor ports filterfilter1sFlowsamplessourcedfrom rviceserviceCLIAPIdeliveryonlymeaningfulcaptures aretaken41sFlowcollectorsFlow exports to collectorProblem subnets are observedthrough behavioral analysis.sFlow collector executes Demonpolicy via the API to send all trafficfrom these subnets to a capturedevicetracer packets are fired toward thecapture device describing thereason and ticket number of theevent
Advanced Use Case 2:Infrastructure Cries for Helpmonitor ports A script is written for the loadbalancer describing a failstate,DDOS signature, or otherperformance degradation. The load balancer executes anHTTP sideband connectioncreating a Demon policy based onthe scripted condition. Tracer packets are fired at thecapture server detailing the reasonfor this adbalancer42
Summary The use of single chip merchant silicon switches and Openflow canbe an adequate replacement for basic tap/mirror aggregation at afraction of the cost. An open API allows for the use of different tools for different tasks. Use of an Openflow controller enables new functionality that theindustry has never had in a commercial solution.43
Thanks Q&A Thanks for attending!44
sFlow' collector' sFlow exports to collector Problem subnets are observed through behavioral analysis. sFlow collector executes Demon policy via the API to send all traffic from these subnets to a capture device tracer packets are fired toward the capture device describing the reason and ticket number of the event