Transcription
RDMA in Data Centers: LookingBack and Looking ForwardChuanxiong GuoMicrosoft ResearchACM SIGCOMM APNet 2017August 3 2017
The Rising of Cloud Computing40AZURE REGIONS
Data Centers
Data Centers
Data center networks (DCN) Cloud scale services: IaaS, PaaS, Search, BigData, Storage, MachineLearning, Deep Learning Services are latency sensitive or bandwidth hungry or both Cloud scale services need cloud scale computing and communicationinfrastructure5
Data center networks (DCN)SpinePodsetLeafPodToR Single ownership Large scale High bisection bandwidth Commodity Ethernet switches TCP/IP protocol suiteServers6
But TCP/IP is not doing well7
TCP latencyLong latency tail2132us (P99)716us (P90)Pingmeshmeasurement results405us (P50)8
TCP processing overhead (40G)SenderReceiver8 tcp connections40G NIC9
An RDMA renaissance story10
Virtual InterfaceArchitecture Spec1.01997Infiniband ArchitectureSpec1.020001.120021.220041.32015RoCE 2010RoCEv2 201411
RDMA Remote Direct Memory Access (RDMA): Method of accessingmemory on a remote system without interrupting the processing ofthe CPU(s) on that system RDMA offloads packet processing protocols to the NIC RDMA in Ethernet based data centers12
RoCEv2: RDMA over Commodity EthernetHardwareKernelUserRDMA appRDMA appRDMA verbsRDMA verbsTCP/IPTCP/IPNIC driverRDMAtransportIPEthernetNIC driverDMADMARDMAtransportIPEthernet RoCEv2 for Ethernet based datacenters RoCEv2 encapsulates packets inUDP OS kernel is not in data path NIC for network protocolprocessing and message DMALosslessnetwork13
RDMA benefit: latency reduction For small msgs ( 32KB), OS processinglatency matters For large msgs (100KB ), speed matters14
RDMA benefit: CPU overhead reductionSenderReceiverOne ND connection40G NIC37Gb/s goodput15
RDMA benefit: CPU overhead reductionIntel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, two sockets 28 coresRDMA: Single QP, 88 Gb/s, 1.7% CPUTCP: Eight connections, 30-50Gb/s,Client: 2.6%, Server: 4.3% CPU16
RoCEv2 needs a lossless Ethernet network PFC for hop-by-hop flow control DCQCN for connection-level congestion control17
Priority-based flow control (PFC)Egress port Hop-by-hop flow control,with eight priorities for HOLblocking mitigation The priority in data packetsis carried in the VLAN tag orDSCP PFC pause frame to informthe upstream to stop PFC causes HOL andcolleterial damagep0p1p7Data packetp0p1PFC pause frameIngress portp0p1p7XOFF thresholdData packetPFC pause frame18
DCQCNSender NICReaction Point(RP)SwitchCongestion Point(CP)Receiver NICNotification Point(NP)DCQCN Keep PFC Use ECN hardware rate-based congestion control CP: Switches use ECN for packet marking NP: periodically check if ECN-marked packets arrived, if so,notify the sender RP: adjust sending rate based on NP feedbacks1919
The lossless requirement causes safety andperformance challenges RDMA transport livelock PFC pause frame storm Slow-receiver symptom PFC deadlock20
RDMA transport livelockSenderSenderPkt drop rate 1/256ReceiverReceiverRDMA Send 0RDMA Send 0RDMA Send 1SwitchSenderReceiverRDMA Send 1RDMA Send N 1RDMA Send N 1RDMA Send N 2RDMA Send N 2NAK NNAK NRDMA Send 0RDMA Send NRDMA Send 1RDMA Send N 1RDMA Send 2RDMA Send N 2Go-back-0Go-back-N 21
PFC deadlock Our data centers use Clos network Packets first travel up then godown No cyclic buffer dependency forup-down routing - no deadlock But we did experience deadlock!SpinePodsetLeafPodToRServers22
PFC deadlock Preliminaries ARP table: IP address to MACaddress mapping MAC table: MAC address to portmapping If MAC entry is missing, packetsare flooded to all portsInputARP tableIPMACTTLIP0MAC02hIP1MAC11hDst: IP1MAC tableMACPortTTLMAC0Port010minMAC1--Output23
PFC deadlockPath: {S1, T0, La, T1, S3}LbLap0Path: {S1, T0, La, T1, S5}p0p1p1Path: {S4, T1, Lb, T0, S2}PFC pause frames23p241p3p3Congested portp4IngressportT1T0p0p1p0p1p2DeadserverPacket dropServerS1S2EgressportS3S4S5PFC pauseframes24
PFC deadlock The PFC deadlock root cause: the interaction between the PFC flowcontrol and the Ethernet packet flooding Solution: drop the lossless packets if the ARP entry is incomplete Recommendation: do not flood or multicast for lossless traffic25
Tagger: practical PFC deadlock preventionS0S1L0L1L2L3T0T1T2T3S0S1L0L1L2L3T0T1T2T3 Concept: Expected Lossless Strategy: move packets toPath (ELP) to decoupledifferent lossless queue beforeTagger from routingCBD forming Tagger Algorithm worksfor general networktopology Deployable in existingswitching ASICs 26
NIC PFC pause frame stormSpine layerPodset 0Podset 1Leaf layerToRs0 A malfunctioning NIC mayblock the whole network1234567 PFC pause frame stormscaused several incidents Solution: watchdogs at bothNIC and switch sides to stopthe stormserversMalfunctioning NIC27
The slow-receiver symptom ToR to NIC is 40Gb/s, NIC to serveris 64Gb/s But NICs may generate largenumber of PFC pause frames Root cause: NIC is resourceconstrained Mitigation Large page size for the MTT (memorytranslation table) entry Dynamic buffer sharing at the ToRServerCPUDRAMPCIeGen3 8x8 64Gb/sWQEsQSFP40Gb/sMTTToRQPCNICPause frames28
Deployment experiences andlessons learned29
Latency reduction RoCEv2 deployed in Bingworld-wide for two andhalf years Significant latencyreduction Incast problem solved asno packet drops30
RDMA throughput Using two podsets each with 500 servers 5Tb/s capacity between the two podsets Achieved 3Tb/s inter-podset throughput Bottlenecked by ECMP routing31 Close to 0 CPU overhead
Latency and throughput tradeoffusL1L0T1T0S0,0L1L1S0,23S1,0S1,23 RDMA latencies increase as datashuffling started Low latency vs high throughputBefore data shufflingDuring data shuffling32
Lessons learned Providing lossless is hard! Deadlock, livelock, PFC pause frames propagation and storm didhappen Be prepared for the unexpected Configuration management, latency/availability, PFC pause frame, RDMAtraffic monitoring NICs are the key to make RoCEv2 work33
What’s next?34
Applications RDMA for X (Search,Storage, HFT, DNN, etc.)Architectures Software vs hardware Lossy vs lossless network RDMA for heterogenouscomputing systemsTechnologies RDMA programmingRDMA virtualizationRDMA securityInter-DC RDMAProtocols Practical, large-scale deadlockfree network Reducing colleterial damage35
Will software win (again)? Historically, software based packet processing won (multiple times) TCP processing overhead analysis by David Clark, et al. Non of the stateful TCP offloading took off (e.g., TCP Chimney) The story is different this time Moore’s law is endingAccelerators are comingNetwork speed keep increasingDemands for ultra low latency are real36
Is lossless mandatory for RDMA? There is no binding between RDMA and lossless network But implementing more sophisticated transport protocol in hardwareis a challenge37
RDMA virtualization for the containernetworkingHost2FreeFlow NetOrchestratorHost1Container1IP: 1.1.1.1Container2IP: PINetAPIFreeFlow NetLibFreeFlow NetLibFreeFlow Shared MemorySpacePhyNICRDMAContainer3IP: 3.3.3.3vNICControlAgentShmSpace A router acts as a proxyfor the containers Shared memory forimproved performance Zero copy possiblePhyNICHost Network38
RDMA for DNN TCP does not work for distributedDNN training For 16-GPU, 2-host speechtraining with CNTK, TCPcommunications dominant thetraining time (72%), RDMA ismuch faster (44%)39
RDMA Programming How many LOC for a “hello world” communication using RDMA? For TCP, it is 60 LOC for client or server code For RDMA, it is complicated IBVerbs: 600 LOC RCMA CM: 300 LOC Rsocket: 60 LOC40
RDMA Programming Make RDMA programming more accessible Easy-to-setup RDMA server and switch configurations Can I run and debug my RDMA code on my desktop/laptop? High quality code samples Loosely coupled vs tightly coupled (Send/Recv vs Write/Read)41
Summary: RDMA for data centers! RDMA is experiencing a renaissance in data centers RoCEv2 has been running safely in Microsoft data centers for two andhalf years Many opportunities and interesting problems for high-speed,low-latency RDMA networking Many opportunities in making RDMA accessible to moredevelopers42
Acknowledgement Yan Cai, Gang Cheng, Zhong Deng, Daniel Firestone, Juncheng Gu,Shuihai Hu, Hongqiang Liu, Marina Lipshteyn, Ali Monfared, JitendraPadhye, Gaurav Soni, Haitao Wu, Jianxi Ye, Yibo Zhu Azure, Bing, CNTK, Philly collaborators Arista Networks, Cisco, Dell, Mellanox partners43
Questions?44
The Rising of Cloud Computing 40 AZURE REGIONS. Data Centers. . Client: 2.6%, Server: 4.3% CPU RDMA benefit: CPU overhead reduction Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, two sockets 28 cores. 17 . Arista Networks, Cisco, Dell, Mellanox partners Acknowledgement. 44 Questions? Title: