NVMe Over Fabrics Discussion On Transports


Architected for PerformanceNVMe over Fabrics – Discussion on TransportsSponsored by NVM Express organization, the owner of NVMe , NVMe-oF and NVMe-MI standards

NVMe-202-1NVMe-201-1NVMe-102-1NVMe-101-1NVM Express Sponsored Track for Flash Memory Summit 2018TrackTitleSpeakers8/7/188:30-9:35NVM Express: NVM Express roadmaps and market data for NVMe, NVMe-oF,and NVMe-MI - what you need to know the next year.Janene Ellefson, MicronJ Metz, CiscoAmber Huffman, IntelDavid Allen, Segate8/7/189:45-10:50NVMe architectures for in Hyperscale Data Centers, Enterprise Data Centers,and in the Client and Laptop space.Janene Ellefson, MicronChris Peterson,FacebookAndy Yang, ToshibaJonmichael Hands, Intel3:40-4:458/7/18NVMe Drivers and Software: This session will cover the software and driversrequired for NVMe-MI, NVMe, NVMe-oF and support from the top operatingsystems.Uma Parepalli, CaviumAustin Bolen, Dell EMCMyron Loewen, IntelLee Prewitt, MicrosoftSuds Jain, VMwareDavid Minturn, IntelJames Harris, Intel4:55-6:008/7/18NVMe-oF Transports: We will cover for NVMe over Fibre Channel, NVMe overRDMA, and NVMe over TCP.Brandon Hoff, EmulexFazil Osman, BroadcomJ Metz, CiscoCurt Beckmann, BrocadePraveen Midha, Marvell8/8/188:30-9:35NVMe-oF Enterprise Arrays: NVMe-oF and NVMe is improving theperformance of classic storage arrays, a multi-billion dollar market.Brandon Hoff, EmulexMichael Peppers, NetAppClod Barrera, IBMFred Night, NetAppBrent Yardley, IBM8/8/189:45-10:50NVMe-oF Appliances: We will discuss solutions that deliver high-performanceand low-latency NVMe storage to automated orchestration-managed clouds.Jeremy Warner, ToshibaManoj Wadekar, eBayKamal Hyder, ToshibaNishant Lodha, MarvellLior Gal, Excelero8/8/183:20-4:25NVMe-oF JBOFs: Replacing DAS storage with Composable Infrastructure(disaggregated storage), based on JBOFs as the storage target.Bryan Cowger,Kazan NetworksPraveen Midha, MarvellFazil Osman, Broadcom8/8/184:40-5:45Testing and Interoperability: This session will cover testing for Conformance,Interoperability, Resilience/error injection testing to ensure interoperablesolutions base on NVM Express solutions.Brandon Hoff, EmulexTim Sheehan, IOLMark Jones, FCIAJason Rusch, ViaviNick Kriczky, Teledyne2

SpeakersBrandon HoffFazil OsmanCurt BeckmannPraveen MidhaJ Metz3

Abstract and Agenda NVMe-oF Abstract: NVMe over Fabrics is designed to be transport agnostic, with all transports beingcreated equal from the perspective of NVM Express. We will cover for NVMe overFibre Channel, NVMe over RDMA, and NVMe over TCP.NVMe-oF Panel NVMe-oF Overview and Scope of our Panel – Brandon Hoff, Emulex (10 min) NVMe over Fibre Channel (NVMe/FC) – Curt Beckmann, Brocade (10 min) NVMe over RoCE (NVMe/RoCE) – Fazil Osman, Broadcom Classic (10 min) NVMe over iWARP (NVMe/iWARP) – Praveen Midha, Marvell/Qlogic (10 min) NVMe over TCP (NVMe/TCP) – J Metz, Cisco (10 min) Q&A (15min)4

Architected for PerformanceNVMe over FabricsBrandon Hoff, Principle Architect, Emulex

NVMe Feature Roadmap2014Q2Q32015Q4Q1Q2NVMe MIQ32016Q4NVMe 1.2 – Nov ‘14 Q1Q2Q32017Q4Q1Q2Namespace ManagementController Memory BufferHost Memory BufferLive Firmware UpdateQ32018Q4Q1Q2Q32019Q4NVMe 1.3NVMe 1.2.1 May’16NVMe-oF 1.0 May’16 Transport and protocolRDMA bindingNVMe-MI 1.0 Nov’15Out-of-band managementDevice discoveryHealth & temp monitoringFirmware UpdateReleased NVMe specificationQ1Q2Q3Q4NVMe 1.4* Sanitize Streams VirtualizationNVMe oFabricNVMe BaseQ1 IO Determinism Persistent memory Region MultipathingNVMe-oF -1.1* Enhanced Discovery TCP Transport BindingNVMe-MI 1.1 SES Based EnclosureManagement NVMe-MI In-band Storage Device EnhancementsPlanned release* Subject to change6

The Value of Shared Storage and the ‘need for speed’ The cost of data-at-rest is no longerthe right metric for storage TCO The value of data is based on how fast it canbe accessed and processed NVMe over Fabrics increases thevelocity of data Faster storage access enables cost reductionthrough consolidation Faster storage access delivers more value from data SSDs are going to become much faster 3D Xpoint Memory, 3D NAND, etc. PMEM, Storage Class Memory, etc. and innovation will continue7

Simplicity of NVMe over FabricsDMA Into (Out Of) AdapterDMA Out Of (Into) AdapterTransferred Overa FabricNVMe-oFApplication ServerStorage Target NVMe-oF delivers a new level of performance for today’s business-criticalapplications NVMe-oF is, by design, is transport agnostic: Application developers can write to a single block storage stack and access NVMe overFibre Channel, TCP, or RDMA networks Data is DMA’d in and out of the adapters to maximize performance Zero copy is available today for Fibre Channel and RDMA protocols for improvedperformance and there are solutions that can provide zero copy for TCP8

Scaling NVMe Requires a (Real) Network Many options, plenty of confusionNVMe Server Software Fibre Channel is the transport for the vastmajority of today’s all flash arraysServer Transport AbstractionFC-NVMe Standardized in Mid-2017 RoCEv2, iWARP and InfiniBand are RDMAbased but not compatible with each otherFibreChannelRoCEv2iWARP InfinibandFCoENVMe-oF RDMA Standardized in 2016 FCoE as a fabric is an option, leverages theFC stack integrated into NVMe-oF 1.0Storage Transport AbstractionNVMe SSDs NVMe/TCP - making it way through thestandards9TCP

NVMe over Fabrics - ArchitectureNVMe Architecture, Queuing Interface,Admin Command, I/O Command Sets,Properties, ANANVMeNVMe over Fabrics Architecture, QueuingInterface, Admin Command & I/OCommand Sets, PropertiesNVMe-oF Fabric Specific Properties, TransportSpecific Features/SpecializationSession NVMe-101-1Transport BindingSpecificationNVMe Transport Binding ServicesNVMe TransportNVMe TransportSession NVMe-102-1Fabric Protocol (may include multiple fabricprotocol layers) FabricPhysical Fabric(e.g. Ethernet, Infiniband, Fibre Channel)10

Architected for PerformanceNVMe over Fibre ChannelCurt Beckmann, Principal Architect, Brocade

Presentation Topics FC-NVMe Spec and Interoperability Update Dual Protocol SANs boost NVMe adoption Performance audit: NVMe/FC v SCSI/FC12

FC-NVMe Spec Status Why move to NVMe /FC? It’s like SCSI/FC tuned for SSDs and parallelism Simpler, more efficient, and (as we’ll see) faster FC-NVMe standard effort is overseen by T11 T11 and INCITS finalized FC/NVMe early 2018 Several vendors are shipping GA products FCIA plugfest last week: 13 participating companies13

Presentation Topics FC-NVMe Spec and Interoperability Update Dual Protocol SANs boost NVMe adoption Performance audit: NVMe/FC v SCSI/FC14

Dual Protocol SANs boost NVMe adoption 80% of today’s Flash arrays connect via FC This is where vital data assets live High-value Assets require protection Storage Teams are naturally risk averse Risk avoidance is part of the job description How can Storage Teams adopt NVMe with low risk? Use familiar infrastructure that speaks both old and new!15

Dual Protocol SANs Reduce Risk Uses existing, familiar, trusted infrastructure No surprises, no duplication of infrastructure and effort Rely on known, established vendors With shared vocabulary and trusted support models Continue to use robust FC Fabric Services Name services, discovery, zoning, flow control Leverage familiar tools and team expertise No need to start all over from scratch16

Dual protocol SANs enable low risk NVMe adoptionEmulex HBAs by Get NVMe performance benefitswhile migrating incrementally “asneeded”Gen 5 HBAsSCSI NVMeSCSINVMe SCSI Migrate application volumes 1 by1 with easy rollback options Interesting dual-protocol usecasesNVMe Traffic Full fabric awareness, visibilityand manageability with existingBrocade Fabric Vision technologySCSI FCPTraffic17

Sample Use Case: Extract Value from High Value Data AssetsStaged Analytics on Real-WorldData SetsProductionSQL DB Using near-live data for analytics isgaining popularity as a way to extractmore valueDualProtocolFabric– But adding traffic loads to live datacan impact its performance Instead, snapshot data on existingEnterprise Storage– Clone the snapshot to NVMe NSID– Run high performance analytics onthe same infrastructure Works in many dimensions– High performance analytics– Easy to operationalize– Leverages current infrastructureSCSI overFibre Me over FibreChannel Array18

Presentation Topics FC-NVMe Spec and Interoperability Update Dual Protocol SANs boost NVMe adoption Performance audit: NVMe/FC v SCSI/FC19

Summary of Demartek ReportPurpose: Credibly document performance benefit of NVMe overFibre Channel (NVMe/FC) is relative to SCSI FCP on vendor targetAudited by: Demartek– Performance Benefits of NVMe over Fibre Channel – A New, Parallel, EfficientProtocolAudit Date: May 1, 2018– PDF available at: www.demartek.com/ModernSANResults of testing both protocols on same hardware:– Up to 58% higher IOPS for NVMe/FC– From 11% to 34% lower latency with NVMe/FCNote: The audit was *not* intended as a test of max array performance20

Results: 4KB Random Reads, full scale and zoomed inSame data with y-axis expanded to see thatNVMe /FC provides a minimum 34% drop inlatencyThis image highlights how NVMe/FC gives 53%/ 54% higher IOPS with 4KB random read I/Os21

Architected for PerformanceNVMe over RoCEFazil Osman, Broadcom Classic

What is RoCE?Remote Direct Memory Access (RDMA)Hardware offload moves data from memory on one CPU to memory of a second CPU without any CPU interventionRDMA over Converged Ethernet (RoCE)Runs over standard Ethernet (L2 or L3 network with RoCEv1 or RoCEv2) with very low latenciesStandard Protocol with Multivendor SupportDefined by IBTASupport from leading NIC vendors – Broadcom, Marvell, MellanoxProven Interoperability at UNH and customer deployments23

Where RoCE fits in with NVMe-oF 24

NVMe over RoCe IO Model & CommandsNVMe Host SoftwareByte 063 64Submission Queue EntryRDMA Transport(N-1)Data or SGLs (if present)Command Capsule of Size N BytesRDMA TransportByte 015 16Completion Queue Entry(N-1)Data (if onse Capsule of Size N BytesIOIOstartCompleteRDMA TransportNVM SubsystemNVMe over Fabrics Command SetsFabricsCommand Set‒‒‒‒AdminCommand SetNVMe commands are encapsulated and sent seamlessly over RoCE transportNVMe multi-Queue modelFabrics commands may be submitted on the Admin Queue or submitted on an I/O QueueProcessing requires minimal intervention of target CPUsI/OCommand SetNVMCmdSetRsvd#125Rsvd#2.Rsvd#7

NVMe over RoCE Advantages Ethernet is the converged protocol for the Data Center RoCE is supported by the leading NIC vendors 80% of shipped RNIC 25G ports in Q1’18 only support RoCE (Crehan) Proven interoperability at UNH and customer deployments RoCE is the lowest latency protocol Sub 5us typical End to End Very low CPU utilization when running RoCE Bypasses TCP transport greatly reducing CPU overhead26

Architected for PerformanceNVMe-oF Transports - iWARPPraveen Midha, Marvell Technologies

Agenda What is iWARP? Why should I care about iWARP? How does iWARP perform? Any real world use cases? Summary28

NVMe-oF Transport Choices Internet Wide-area RDMA Protocol (iWARP)NVMe TransportsLocal BusFabric Message TransportsuserapplicationcontextswitchI/O libraryapp bufferOSkernelMemoryCapsuleCapsule/Datadevice driverTCP/IP RDMAPCIeFCTCPIBRoCEadapter bufferiWARPRNICInternet Wide-area RDMA Protocol (iWARP)29

RDMA Scalability ComparisonRoCEv2 DCQCNiWARPRoCEv2RoCERDMA over IBNot routableRDMA over UDPRoutableRequired DCBP2P Flow ControlRDMA over UDPRoutableRequired DCBP2P Flow ControlE2E Congestion ControlRDMA over TCPRoutableStandard EthernetE2E Flow Control with TCPCongestion AvoidanceInternet Wide-area RDMA Protocol (iWARP)30

NVMe-oF Latency – Single I/ONVMe-oF Latency Comparison1DISK/1JOB/1DEPTH1DISK/1JOB/4KB READsCLAT (us)Latency CLAT (us)NVMe-oF Latency Comparison12481632Block Size (KB)QL41xxx 25GbE iWARP64128QL41xxx 25GbE RoCE99.95th99.9th99.5thPercentile of I/O OperationsQL41xxx 25GbE iWARP99thQL41xxx 25GbE RoCEInternet Wide-area RDMA Protocol (iWARP)31

Storage Spaces Direct (S2D) – Hyper-ConvergedRoCEv2 or iWARP(function of cluster size)Internet Wide-area RDMA Protocol (iWARP)32

S2D Performance – iWARP vs q-41000/ 33

Summary - iWARPis one of several transport choices for deploying NVMe-oF Wide Area Networks supportedAssumes standard Ethernet – no DCB!Reliable connected communication provided by congestion-aware TCP protocolPerforms as well as RoCE/RoCEv2Internet Wide-area RDMA Protocol (iWARP)34

Architected for PerformanceNVMe over TCPJ Metz, Cisco

What’s Special About NVMe-oF : BindingsWhat is a Binding? “A specification of reliable delivery of data, commands,and responses between a host and an NVMsubsystem for an NVMe Transport. The binding mayexclude or restrict functionality based on the NVMeTransport’s capabilities.”I.e., it’s the “glue” that links all the pieces above andbelow (examples): SGL Descriptions Data placement restrictions Data transport capabilities Authentication capabilities36

NVMe /TCP in a NutshellNVMe-oF commands sent overstandard TCP/IPsocketsEach NVMe queue pairmapped to a TCPconnectionTCP provides a reliabletransport layer forNVMe queueing model373

NVMe /TCP Data Path UsageEnables NVMe-oF I/O operations in existing IPDatacenter environments Software-only NVMe Host Driver with NVMe-TCPtransport Provides an NVMe-oF alternative to iSCSI for StorageSystems with PCIe NVMe SSDs More efficient End-to-End NVMe Operations byelimination SCSI to NVMe translations Co-exists with other NVMe-oF transports Transport selection may be nased on h/wsupport and/or policy383

NVMe /TCP Control Path UsageEnables use of NVMe-oF on ControlPath Networks (example: 1g Ethernet)Discovery Service UsageDiscovery controllers residing on acommon control network that is separatefrom data-path networksNVMe-MI UsageNVMe-MI endpoints on controlprocessors (BMC, .) with simple IPnetwork stacks(1g Ethernet)NVMe-MI on separate controlnetworkSource: Dave Minturn (Intel)393

How NVMe /TCP WorksMultiple NVMe/TCP dataunits in a single TCP/IPpacket TCP accepts data in the form of a data stream and breaks the stream into units A TCP header is added to a unit creating a TCP segment A segment is then encapsulated in an Internet Protocol (IP) datagram creating a TCP/IPpacketSingle NVMe/TCP dataunit spanning multipleTCP/IP packets40

NVMe /TCP Message ModelNVMe/TCP connection is associated with a single Adminor I/O SQ/CQ pair No spanning across queues or across TCPconnections!Data transfers supported by: Fabric-specific data transfer mechanism In-Capsule data (optional) Allows for variable capsule sizesAll NVMe/TCP implementations support data transfersusing command data buffers414

Potential Issues With NVMe /TCPAbsolute latency higher than RDMA?Head-of-line blocking leading toincreased latency?Delayed acks could increaselatency?Incast could be an issue?Lack of hardware acceleration?Only matters if the application caresabout latencyProtocol breaks up large transfersAcks are used to ‘pace thetransmission of packets such thatTCP is “self-clocking”Switching network can provideApproximate Fair Drop (AFD) foractive switching queue mgmt, andDynamic Packet Prioritization (DPP)to ensure incast flows are serviced asfast as possibleNot an issue for NVMe/TCP use-cases444

NVMe /TCP StandardizationExpect NVMe over TCP standard to be ratified in2H 2018The NVMe-oF 1.1 TCP ballot passed in April2017NVMe Workgroup adding TCP to specalongside RDMA454


Contact InformationFor more information please contact the following:Brandon Hoffbrandon.hoff@broadcom.comCurt Beckmanncurt.beckmann@broadcom.comFazil Osmanfazil.osman@broadcom.comPraveen MidhaPraveen.Midha@cavium.comFazil Osmanfazil.osman@broadcom.comJ Metzjmmetz@cisco.com@drjmetz47

Architected for Performance

NVMe-oF JBOFs: Replacing DAS storage with Composable Infrastructure (disaggregated storage), based on JBOFs as the storage target. Bryan Cowger, Kazan Networks Praveen Midha, Marvell Fazil Osman,