RDMA In Data Centers: Looking Back And Looking Forward - SIGCOMM

Transcription

RDMA in Data Centers: LookingBack and Looking ForwardChuanxiong GuoMicrosoft ResearchACM SIGCOMM APNet 2017August 3 2017

The Rising of Cloud Computing40AZURE REGIONS

Data Centers

Data Centers

Data center networks (DCN) Cloud scale services: IaaS, PaaS, Search, BigData, Storage, MachineLearning, Deep Learning Services are latency sensitive or bandwidth hungry or both Cloud scale services need cloud scale computing and communicationinfrastructure5

Data center networks (DCN)SpinePodsetLeafPodToR Single ownership Large scale High bisection bandwidth Commodity Ethernet switches TCP/IP protocol suiteServers6

But TCP/IP is not doing well7

TCP latencyLong latency tail2132us (P99)716us (P90)Pingmeshmeasurement results405us (P50)8

TCP processing overhead (40G)SenderReceiver8 tcp connections40G NIC9

An RDMA renaissance story10

Virtual InterfaceArchitecture Spec1.01997Infiniband ArchitectureSpec1.020001.120021.220041.32015RoCE 2010RoCEv2 201411

RDMA Remote Direct Memory Access (RDMA): Method of accessingmemory on a remote system without interrupting the processing ofthe CPU(s) on that system RDMA offloads packet processing protocols to the NIC RDMA in Ethernet based data centers12

RoCEv2: RDMA over Commodity EthernetHardwareKernelUserRDMA appRDMA appRDMA verbsRDMA verbsTCP/IPTCP/IPNIC driverRDMAtransportIPEthernetNIC driverDMADMARDMAtransportIPEthernet RoCEv2 for Ethernet based datacenters RoCEv2 encapsulates packets inUDP OS kernel is not in data path NIC for network protocolprocessing and message DMALosslessnetwork13

RDMA benefit: latency reduction For small msgs ( 32KB), OS processinglatency matters For large msgs (100KB ), speed matters14

RDMA benefit: CPU overhead reductionSenderReceiverOne ND connection40G NIC37Gb/s goodput15

RDMA benefit: CPU overhead reductionIntel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, two sockets 28 coresRDMA: Single QP, 88 Gb/s, 1.7% CPUTCP: Eight connections, 30-50Gb/s,Client: 2.6%, Server: 4.3% CPU16

RoCEv2 needs a lossless Ethernet network PFC for hop-by-hop flow control DCQCN for connection-level congestion control17

Priority-based flow control (PFC)Egress port Hop-by-hop flow control,with eight priorities for HOLblocking mitigation The priority in data packetsis carried in the VLAN tag orDSCP PFC pause frame to informthe upstream to stop PFC causes HOL andcolleterial damagep0p1p7Data packetp0p1PFC pause frameIngress portp0p1p7XOFF thresholdData packetPFC pause frame18

DCQCNSender NICReaction Point(RP)SwitchCongestion Point(CP)Receiver NICNotification Point(NP)DCQCN Keep PFC Use ECN hardware rate-based congestion control CP: Switches use ECN for packet marking NP: periodically check if ECN-marked packets arrived, if so,notify the sender RP: adjust sending rate based on NP feedbacks1919

The lossless requirement causes safety andperformance challenges RDMA transport livelock PFC pause frame storm Slow-receiver symptom PFC deadlock20

RDMA transport livelockSenderSenderPkt drop rate 1/256ReceiverReceiverRDMA Send 0RDMA Send 0RDMA Send 1SwitchSenderReceiverRDMA Send 1RDMA Send N 1RDMA Send N 1RDMA Send N 2RDMA Send N 2NAK NNAK NRDMA Send 0RDMA Send NRDMA Send 1RDMA Send N 1RDMA Send 2RDMA Send N 2Go-back-0Go-back-N 21

PFC deadlock Our data centers use Clos network Packets first travel up then godown No cyclic buffer dependency forup-down routing - no deadlock But we did experience deadlock!SpinePodsetLeafPodToRServers22

PFC deadlock Preliminaries ARP table: IP address to MACaddress mapping MAC table: MAC address to portmapping If MAC entry is missing, packetsare flooded to all portsInputARP tableIPMACTTLIP0MAC02hIP1MAC11hDst: IP1MAC tableMACPortTTLMAC0Port010minMAC1--Output23

PFC deadlockPath: {S1, T0, La, T1, S3}LbLap0Path: {S1, T0, La, T1, S5}p0p1p1Path: {S4, T1, Lb, T0, S2}PFC pause frames23p241p3p3Congested portp4IngressportT1T0p0p1p0p1p2DeadserverPacket dropServerS1S2EgressportS3S4S5PFC pauseframes24

PFC deadlock The PFC deadlock root cause: the interaction between the PFC flowcontrol and the Ethernet packet flooding Solution: drop the lossless packets if the ARP entry is incomplete Recommendation: do not flood or multicast for lossless traffic25

Tagger: practical PFC deadlock preventionS0S1L0L1L2L3T0T1T2T3S0S1L0L1L2L3T0T1T2T3 Concept: Expected Lossless Strategy: move packets toPath (ELP) to decoupledifferent lossless queue beforeTagger from routingCBD forming Tagger Algorithm worksfor general networktopology Deployable in existingswitching ASICs 26

NIC PFC pause frame stormSpine layerPodset 0Podset 1Leaf layerToRs0 A malfunctioning NIC mayblock the whole network1234567 PFC pause frame stormscaused several incidents Solution: watchdogs at bothNIC and switch sides to stopthe stormserversMalfunctioning NIC27

The slow-receiver symptom ToR to NIC is 40Gb/s, NIC to serveris 64Gb/s But NICs may generate largenumber of PFC pause frames Root cause: NIC is resourceconstrained Mitigation Large page size for the MTT (memorytranslation table) entry Dynamic buffer sharing at the ToRServerCPUDRAMPCIeGen3 8x8 64Gb/sWQEsQSFP40Gb/sMTTToRQPCNICPause frames28

Deployment experiences andlessons learned29

Latency reduction RoCEv2 deployed in Bingworld-wide for two andhalf years Significant latencyreduction Incast problem solved asno packet drops30

RDMA throughput Using two podsets each with 500 servers 5Tb/s capacity between the two podsets Achieved 3Tb/s inter-podset throughput Bottlenecked by ECMP routing31 Close to 0 CPU overhead

Latency and throughput tradeoffusL1L0T1T0S0,0L1L1S0,23S1,0S1,23 RDMA latencies increase as datashuffling started Low latency vs high throughputBefore data shufflingDuring data shuffling32

Lessons learned Providing lossless is hard! Deadlock, livelock, PFC pause frames propagation and storm didhappen Be prepared for the unexpected Configuration management, latency/availability, PFC pause frame, RDMAtraffic monitoring NICs are the key to make RoCEv2 work33

What’s next?34

Applications RDMA for X (Search,Storage, HFT, DNN, etc.)Architectures Software vs hardware Lossy vs lossless network RDMA for heterogenouscomputing systemsTechnologies RDMA programmingRDMA virtualizationRDMA securityInter-DC RDMAProtocols Practical, large-scale deadlockfree network Reducing colleterial damage35

Will software win (again)? Historically, software based packet processing won (multiple times) TCP processing overhead analysis by David Clark, et al. Non of the stateful TCP offloading took off (e.g., TCP Chimney) The story is different this time Moore’s law is endingAccelerators are comingNetwork speed keep increasingDemands for ultra low latency are real36

Is lossless mandatory for RDMA? There is no binding between RDMA and lossless network But implementing more sophisticated transport protocol in hardwareis a challenge37

RDMA virtualization for the containernetworkingHost2FreeFlow NetOrchestratorHost1Container1IP: 1.1.1.1Container2IP: PINetAPIFreeFlow NetLibFreeFlow NetLibFreeFlow Shared MemorySpacePhyNICRDMAContainer3IP: 3.3.3.3vNICControlAgentShmSpace A router acts as a proxyfor the containers Shared memory forimproved performance Zero copy possiblePhyNICHost Network38

RDMA for DNN TCP does not work for distributedDNN training For 16-GPU, 2-host speechtraining with CNTK, TCPcommunications dominant thetraining time (72%), RDMA ismuch faster (44%)39

RDMA Programming How many LOC for a “hello world” communication using RDMA? For TCP, it is 60 LOC for client or server code For RDMA, it is complicated IBVerbs: 600 LOC RCMA CM: 300 LOC Rsocket: 60 LOC40

RDMA Programming Make RDMA programming more accessible Easy-to-setup RDMA server and switch configurations Can I run and debug my RDMA code on my desktop/laptop? High quality code samples Loosely coupled vs tightly coupled (Send/Recv vs Write/Read)41

Summary: RDMA for data centers! RDMA is experiencing a renaissance in data centers RoCEv2 has been running safely in Microsoft data centers for two andhalf years Many opportunities and interesting problems for high-speed,low-latency RDMA networking Many opportunities in making RDMA accessible to moredevelopers42

Acknowledgement Yan Cai, Gang Cheng, Zhong Deng, Daniel Firestone, Juncheng Gu,Shuihai Hu, Hongqiang Liu, Marina Lipshteyn, Ali Monfared, JitendraPadhye, Gaurav Soni, Haitao Wu, Jianxi Ye, Yibo Zhu Azure, Bing, CNTK, Philly collaborators Arista Networks, Cisco, Dell, Mellanox partners43

Questions?44

The Rising of Cloud Computing 40 AZURE REGIONS. Data Centers. . Client: 2.6%, Server: 4.3% CPU RDMA benefit: CPU overhead reduction Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, two sockets 28 cores. 17 . Arista Networks, Cisco, Dell, Mellanox partners Acknowledgement. 44 Questions? Title: