Introduction To InfiniBand For End Users

Transcription

Introduction to InfiniBand for End UsersIndustry-Standard Value and Performance forHigh Performance Computing and the EnterprisePaul GrunInfiniBand Trade Association

INTRO TO INFINIBAND FOR END USERSCopyright 2010 InfiniBand Trade Association.Other names and brands are properties of their respective owners.[ 2 ]

ContentsContentsAcknowledgements. 4About the Author. 4Introduction. 5Chapter 1 – Basic Concepts. 7Chapter 2 – InfiniBand for HPC. 14InfiniBand MPI Clusters. 15Storage and HPC. 16Upper Layer Protocols for HPC. 18Chapter 3 – InfiniBand for the Enterprise. 19Devoting Server Resources to Application Processing. 20A Flexible Server Architecture. 21Investment Protection. 24Cloud Computing – An Emerging Data Center Model. 25InfiniBand in the Distributed Enterprise. 28Chapter 4 – Designing with InfiniBand. 31Building on Top of the Verbs. 32Chapter 5 – InfiniBand Architecture and Features. 37Address Translation. 38The InfiniBand Transport. 39InfiniBand Link Layer Considerations. 42Management and Services. 42InfiniBand Management Architecture. 43Chapter 6 – Achieving an Interoperable Solution. 45An Interoperable Software Solution. 45Ensuring Hardware Interoperability. 46Chapter 7 – InfiniBand Performance Capabilities and Examples. 48Chapter 8 – Into the Future. 52[ 3 ]

INTRO TO INFINIBAND FOR END USERSAcknowledgementsSpecial thanks to Gilad Shainer at Mellanox for his work and content related toChapter 7 - InfiniBand Performance Capabilities and Examples.Thanks to the following individuals for their contributions to this publication:David Southwell, Ariel Cohen, Rupert Dance, Jay Diepenbrock, Jim Ryan, CheriWinterberg and Brian Sparks.About the AuthorPaul Grun has been working on I/O for mainframes and servers for more than30 years and has been involved in high performance networks and RDMA technology since before the inception of InfiniBand. As a member of the InfiniBand TradeAssociation, Paul served on the Link Working Group during the development of theInfiniBand Architecture; he is a member of the IBTA Steering Committee and is apast chair of the Technical Working Group. He recently served as chair of the IBXoEWorking Group, charged with developing the new RDMA over Converged Ethernet(RoCE) specification. He is currently chief scientist for System Fabric Works, Inc.,a consulting and professional services company dedicated to delivering RDMA andstorage solutions for high performance computing, commercial enterprise and cloudcomputing systems.[ 4 ]

IntroductionIntroductionIntroductionInfiniBand is not complex. Despite its reputation as an exotic technology, theconcepts behind it are surprisingly straightforward. One purpose of this book is toclearly describe the basic concepts behind the InfiniBand Architecture.Understanding the basic concepts is all well and good, but not very useful unlessthose concepts deliver a meaningful and significant value to the user of the technology. And that is the real purpose of this book: to draw straight-line connectionsbetween the basic concepts behind the InfiniBand Architecture and how thoseconcepts deliver real value to the enterprise data center and to high performancecomputing (HPC).The readers of this book are exquisitely knowledgeable in their respective fields ofendeavor, whether those endeavors are installing and maintaining a large HPC cluster, overseeing the design and development of an enterprise data center, deployingservers and networks, or building devices/appliances targeted at solving a specificproblem requiring computers. But most readers of this book are unlikely to be experts in the arcane details of networking architecture. Although the working title forthis book was “InfiniBand for Dummies,” its readers are anything but dummies. Thisbook is designed to give its readers, in one short sitting, a view into the InfiniBandArchitecture that you probably could not get by reading the specification itself.The point is that you should not have to hold an advanced degree in obscurenetworking technologies in order to understand how InfiniBand technology can bringbenefits to your chosen field of endeavor and to understand what those benefitsare. This minibook is written for you. At the end of the next hour you will be able toclearly see how InfiniBand can solve problems you are facing today.By the same token, this book is not a detailed technical treatise of the underlyingtheory, nor does it provide a tutorial on deploying the InfiniBand Architecture. All wecan hope for in this short book is to bring a level of enlightenment about this exciting technology. The best measure of our success is if you, the reader, feel motivatedafter reading this book to learn more about how to deploy the InfiniBand Architecture in your computing environment.We will avoid the low-level bits and bytes of the elegant architecture (which arecomplex) and focus on the concepts and applications that are visible to the user ofInfiniBand technology. When you decide to move forward with deploying InfiniBand,there is ample assistance available in the market to help you deploy it, just as thereis for any other networking technology like traditional TCP/IP/Ethernet. The InfiniBand Trade Association and its website (www.infinibandta.org) are great sources[ 5 ]

INTRO TO INFINIBAND FOR END USERSof access to those with deep experience in developing, deploying and using theInfiniBand Architecture.The InfiniBand Architecture emerged in 1999 as the joining of two competingproposals known as Next Generation I/O and Future I/O. These proposals, and theInfiniBand Architecture that resulted from their merger, are all rooted in the Virtual Interface Architecture, VIA. The Virtual Interface Architecture is based on twosynergistic concepts: direct access to a network interface (e.g. a NIC) straight fromapplication space, and an ability for applications to exchange data directly betweentheir respective virtual buffers across a network, all without involving the operatingsystem directly in the address translation and networking processes needed to doso. This is the notion of “Channel I/O” – the creation of a “virtual channel” directlyconnecting two applications that exist in entirely separate address spaces. We willcome back to a detailed description of these key concepts in the next few chapters.InfiniBand is often compared, and not improperly, to a traditional network suchas TCP/IP/Ethernet. In some respects it is a fair comparison, but in many otherrespects the comparison is wildly off the mark. It is true that InfiniBand is based onnetworking concepts and includes the usual “layers” found in a traditional network,but beyond that there are more differences between InfiniBand and an IP networkthan there are similarities. The key is that InfiniBand provides a messaging servicethat applications can access directly. The messaging service can be used for storage, for InterProcess Communication (IPC) or for a host of other purposes, anythingthat requires an application to communicate with others in its environment. The keybenefits that InfiniBand delivers accrue from the way that the InfiniBand messagingservice is presented to the application, and the underlying technology used to transport and deliver those messages. This is much different from TCP/IP/Ethernet, whichis a byte-stream oriented transport for conducting bytes of information betweensockets applications.The first chapter gives a high-level view of the InfiniBand Architecture andreviews the basic concepts on which it is founded. The next two chapters relatethose basic concepts to value propositions which are important to users in theHPC community and in the commercial enterprise. Each of those communities hasunique needs, but both benefit from the advantages InfiniBand can bring. Chapter 4 addresses InfiniBand’s software architecture and describes the open sourcesoftware stacks that are available to ease and simplify the deployment of InfiniBand.Chapter 5 delves slightly deeper into the nuances of some of the key elements ofthe architecture. Chapter 6 describes the efforts undertaken by both the InfiniBandTrade Association and the OpenFabrics Alliance (www.openfabrics.org) to ensurethat solutions from a range of vendors will interoperate and that they will behave inconformance with the specification. Chapter 7 describes some comparative performance proof points, and finally Chapter 8 briefly reviews the ways in which theInfiniBand Trade Association ensures that the InfiniBand Architecture continues tomove ahead into the future.[ 6 ]

Chapter 1 – Basic ConceptsChapter 1 – Basic ConceptsNetworks are often thought of as the set of routers, switches and cables thatare plugged into servers and storage devices. When asked, most people wouldprobably say that a network is used to connect servers to other servers, storageand a network backbone. It does seem that traditional networking generallystarts with a “bottoms up” view, with much attention focused on the underlyingwires and switches. This is a very “network centric” view of the world; the waythat an application communicates is driven partly by the nature of the traffic.This is what has given rise to the multiple dedicated fabrics found in manydata centers today.A server generally provides a selection of communication services to theapplications it supports; one for storage, one for networking and frequently athird for specialized IPC traffic. This complement of communications stacks(storage, networking, etc.) and the device adapters accompanying each onecomprise a shared resource. This means that the operating system “owns”these resources and makes them available to applications on an as-neededbasis. An application, in turn, relies on the operating system to provide it withthe communication services it needs. To use one of these services, the application uses some sort of interface or API to make a request to the operatingsystem, which conducts the transaction on behalf of the application.InfiniBand, on the other hand begins with a distinctly “application centric” view by asking a profoundly simple question: How to make applicationaccesses to other applications and to storage as simple, efficient and directas possible? If one begins to look at the I/O problem from this applicationcentric perspective one comes up with a much different approach to networkarchitecture.The basic idea behind InfiniBand is simple; it provides applications withan easy-to-use messaging service. This service can be used to communicatewith other applications or processes or to access storage. That’s the wholeidea. Instead of making a request to the operating system for access to one ofthe server’s communication resources, an application accesses the InfiniBandmessaging service directly. Since the service provided is a straightforward messaging service, the complex dance between an application and a traditional[ 7 ]

INTRO TO INFINIBAND FOR END USERSnetwork is eliminated. Since storage traffic can be viewed as consisting ofcontrol messages and data messages, the same messaging service is equally athome for use in storage applications.A key to this approach is that InfiniBand Architecture gives every applicationdirect access to the messaging service. Direct access means that an applicationneed not rely on the operating system to transfer messages. This idea is in contrast to a standard network environment where the shared network resources(e.g. the TCP/IP network stack and associated NICs) are solely owned by theoperating system and cannot be accessed by the user application. This meansthat an application cannot have direct access to the network and insteadmust rely on the involvement of the operating system to move data from theapplication’s virtual buffer space, through the network stack and out onto thewire. There is an identical involvement of the operating system at the otherend of the connection. InfiniBand avoids this through a technique known asstack bypass. How it does this, while ensuring the same degree of applicationisolation and protection that an operating system would provide, is the one keypiece of secret sauce that underlines the InfiniBand Architecture.InfiniBand provides the messaging service by creating a channel connectingan application to any other application or service with which the application needs to communicate. The applications using the service can be eitheruser space applications or kernel applications such as a file system. Thisapplication-centric approach to the computing problem is the key differentiatorbetween InfiniBand and traditional networks. You could think of it as a “topsdown” approach to architecting a networking solution. Everything else in theInfiniBand Architecture is there to support this one simple goal: provide a message service to be used by an application to communicate directly with anotherapplication or with storage.The challenge for InfiniBand’s designers was to create these channelsbetween virtual address spaces which would be capable of carrying messagesof varying sizes, and to ensure that the channels are isolated and protected.These channels would need to serve as pipes or connections between entirelydisjoint virtual address spaces. In fact, the two virtual spaces might even belocated in entirely disjoint physical address spaces – in other words, hosted bydifferent servers even over a distance.[ 8 ]

Chapter 1 – Basic ConceptsAppBufBufAppOSNICNICOSFigure 1: InfiniBand creates a channel directly connecting an application in its virtualaddress space to an application in another virtual address space. The two applicationscan be in disjoint physical address spaces – hosted by different servers.It is convenient to give a name to the endpoints of the channel – we’ll callthem Queue Pairs (QPs); each QP consists of a Send Queue and a ReceiveQueue, and each QP represents one end of a channel. If an applicationrequires more than one connection, more QPs are created. The QPs are thestructure by which an application accesses InfiniBand’s messaging service. Inorder to avoid involving the OS, the applications at each end of the channelmust have direct access to these QPs. This is accomplished by mapping theQPs directly into each application’s virtual address space. Thus, the applicationat each end of the connection has direct, virtual access to the channel connecting it to the application (or storage) at the other end of the channel. This isthe notion of Channel I/O.Taken altogether then, this is the essence of the InfiniBand Architecture. Itcreates private, protected channels between two disjoint virtual address spaces,it provides a channel endpoint, called a QP, to the applications at each end ofthe channel, and it provides a means for a local application to transfer messages directly between applications residing in those disjoint virtual addressspaces. Channel I/O in a nutshell.Having established a channel, and having created a virtual endpoint to thechannel, there is one further architectural nuance needed to complete thechannel I/O picture and that is the actual method for transferring a message.InfiniBand provides two transfer semantics; a channel semantic sometimescalled SEND/RECEIVE and a pair of memory semantics called RDMA READand RDMA WRITE. When using the channel semantic, the message is receivedin a data structure provided by the application on the receiving side. This datastructure was pre-posted on its receive queue. Thus, the sending side doesnot have visibility into the buffers or data structures on the receiving side;instead, it simply SENDS the message and the receiving application RECEIVESthe message.[ 9 ]

INTRO TO INFINIBAND FOR END USERSThe memory semantic is somewhat different; in this case the receiving sideapplication registers a buffer in its virtual memory space. It passes control ofthat buffer to the sending side which then uses RDMA READ or RDMA WRITEoperations to either read or write the data in that buffer.A typical storage operation may illustrate the difference. The “initiator”wishes to store a block of data. To do so, it places the block of data in a bufferin its virtual address space and uses a SEND operation to send a storagerequest to the “target” (e.g. the storage device). The target, in turn, uses RDMAREAD operations to fetch the block of data from the initiator’s virtual buffer.Once it has completed the operation, the target uses a SEND operation toreturn ending status to the initiator. Notice that the initiator, having requestedservice from the target, was free to go about its other business while the targetasynchronously completed the storage operation, notifying the initiator oncompletion. The data transfer phase and the storage operation, of course, comprise the bulk of the operation.Included as part of the InfiniBand Architecture and located right abovethe transport layer is the software transport interface. The software transportinterface contains the QPs; keep in mind that the queue pairs are the structure by which the RDMA message transport service is accessed. The software transport interface also defines all the methods and mechanisms thatan application needs to take full advantage of the RDMA message transportservice. For example, the software transport interface describes the methodsthat applications use to establish a channel between them. An implementationof the software transport interface includes the APIs and libraries needed byan application to create and control the channel and to use the QPs in order totransfer messages.ApplicationS/W iceRDMA MessageTransport ServiceNetworkLinkPhysicalFigure 2: The InfiniBand Architecture provides an easy-to-use messaging service toapplications. The messaging service includes a communications stack similar to thefamiliar OSI reference model.[ 10 ]

Chapter 1 – Basic ConceptsUnderneath the covers of the messaging service all this still requires acomplete network stack just as you would find in any traditional network.It includes the InfiniBand transport layer to provide reliability and deliveryguarantees (similar to the TCP transport in an IP network), a network layer(like the IP layer in a traditional network) and link and physical layers (wiresand switches). But it’s a special kind of a network stack because it has features that make it simple to transport messages directly between applications’virtual memory, even if the applications are “remote” with respect to eachother. Hence, the combination of InfiniBand’s transport layer together with thesoftware transport interface is better thought of as a Remote Direct MemoryAccess (RDMA) message transport service. The entire stack taken together,including the software transport interface, comprise the InfiniBand messagingservice.How does the application actually transfer a message? The InfiniBandArchitecture provides simple mechanisms defined in the software transportinterface for placing a request to perform a message transfer on a queue. Thisqueue is the QP, representing the channel endpoint. The request is called aWork Request (WR) and represents a single quantum of work that the application wants to perform. A typical WR, for example, describes a message that theapplication wishes to have transported to another application.The notion of a WR as representing a message is another of the key distinctions between InfiniBand and a traditional network; the InfiniBand Architectureis said to be message-oriented. A message can be any size ranging up to2**31 bytes in size. This means that the entire architecture is oriented aroundmoving messages. This messaging service is a distinctly different service thanis provided by other traditional networks such as TCP/IP, which are adept atmoving strings of bytes from the operating system in one node to the operatingsystem in another node. As the name suggests, a byte stream-oriented networkpresents a stream of bytes to an application. Each time an Ethernet packetarrives, the server NIC hardware places the bytes comprising the packet intoan anonymous buffer in main memory belonging to the operating system. Oncethe stream of bytes has been received into the operating system’s buffer, theTCP/IP network stack signals the application to request a buffer from the application into which the bytes can be placed. This process is repeated each time apacket arrives until the entire message is eventually received.InfiniBand on the other hand, delivers a complete message to an application.Once an application has requested transport of a message, the InfiniBand hardware automatically segments the outbound message into a number of packets;the packet size is chosen to optimize the available network bandwidth. Packetsare transmitted through the network by the InfiniBand hardware and at thereceiving end they are delivered directly into the receiving application’s virtualbuffer where they are re-assembled into a complete message. Once the entire[ 11 ]

INTRO TO INFINIBAND FOR END USERSmessage has been received the receiving application is notified. Neither thesending nor the receiving application is involved until the complete message isdelivered to the receiving application’s virtual buffer.How, exactly, does a sending application, for example send a message? TheInfiniBand software transport interface specification defines the concept of averb. The word “verb” was chosen since a verb describes action, and this isexactly how an application requests action from the messaging service; it usesa verb. The set of verbs, taken together, are simply a semantic description ofthe methods the application uses to request service from the RDMA messagetransport service. For example, “Post Send Request” is a commonly used verbto request transmission of a message to another application.The verbs are fully defined by the specification, and they are the basis forspecifying the APIs that an application uses. But the InfiniBand Architecturespecification doesn’t define the actual APIs; that is left to other organizationssuch as the OpenFabrics Alliance, which provides a complete set of opensource APIs and software which implements the verbs and works seamlesslywith the InfiniBand hardware. A richer description of the concept of “verbs,”and how one designs a system using InfiniBand and the verbs, is contained ina later chapter.The InfiniBand Architecture defines a full set of hardware components necessary to deploy the architecture. Those components are: HCA – Host Channel Adapter. An HCA is the point at which an InfiniBandend node, such as a server or storage device, connects to the InfiniBandnetwork. InfiniBand’s architects went to significant lengths to ensure that thearchitecture would support a range of possible implementations, thus thespecification does not require that particular HCA functions be implementedin hardware, firmware or software. Regardless, the collection of hardware,software and firmware that represents the HCA’s functions provides the applications with full access to the network resources. The HCA contains addresstranslation mechanisms under the control of the operating system that allowan application to access the HCA directly. In effect, the Queue Pairs that theapplication uses to access the InfiniBand hardware appear directly in theapplication’s virtual address space. The same address translation mechanismis the means by which an HCA accesses memory on behalf of a user levelapplication. The application, as usual, refers to virtual addresses; the HCAhas the ability to translate these into physical addresses in order to affect theactual message transfer. TCA – Target Channel Adapter. This is a specialized version of a channeladapter intended for use in an embedded environment such as a storageappliance. Such an appliance may be based on an embedded operatingsystem, or may be entirely based on state machine logic and therefore maynot require a standard interface for applications. The concept of a TCA is[ 12 ]

Chapter 1 – Basic Conceptsnot widely deployed today; instead most I/O devices are implemented usingstandard server motherboards controlled by standard operating systems. Switches – An InfiniBand Architecture switch is conceptually similar to anyother standard networking switch, but molded to meet InfiniBand’s performance and cost targets. For example, InfiniBand switches are designedto be “cut through” for performance and cost reasons and they implementInfiniBand’s link layer flow control protocol to avoid dropped packets. This isa subtle, but key element of InfiniBand since it means that packets are neverdropped in the network during normal operation. This “no drop” behavior iscentral to the operation of InfiniBand’s highly efficient transport protocol. Routers – Although not currently in wide deployment, an InfiniBand routeris intended to be used to segment a very large network into smaller subnetsconnected together by an InfiniBand router. Since InfiniBand’s managementarchitecture is defined on a per subnet basis, using an InfiniBand routerallows a large network to be partitioned into a number of smaller subnetsthus enabling the deployment of InfiniBand networks that can be scaled tovery large sizes, without the adverse performance impacts due to the needto route management traffic throughout the entire network. In addition to thistraditional use, specialized InfiniBand routers have also been deployed in aunique way as range extenders to interconnect two segments of an InfiniBand subnet that is distributed across wide geographic distances. Cables and Connectors – Volume 2 of the Architecture Specification is devoted to the physical and electrical characteristics of InfiniBand and defines,among other things, the characteristics and specifications for InfiniBandcables, both copper and electrical. This has enabled vendors to develop andoffer for sale a wide range of both copper and optical cables in a broad rangeof widths (4x, 12x) and speed grades (SDR, DDR, QDR).That’s it for the basic concepts underlying InfiniBand. The InfiniBand Architecture brings a broad range of value to its users, ranging from ultra-low latencyfor clustering to extreme flexibility in application deployment in the data centerto dramatic reductions in energy consumption and all at very low price/performance points. Figuring out which value propositions are relevant to whichsituations depends entirely on the destined use.The next two chapters describe how these basic concepts map onto particular value propositions in two important market segments. Chapter 2 describeshow InfiniBand relates to High Performance Computing (HPC) clusters, andChapter 3 does the same thing for enterprise data centers. As we will see,InfiniBand’s basic concepts can help solve a range of problems, regardlesswhether you are trying to solve a sticky performance problem or if your datacenter is suffering from severe space, power or cooling constraints. A wisechoice of an appropriate interconnect can profoundly impact the solution.[ 13 ]

INTRO TO INFINIBAND FOR END USERSChapter 2 – InfiniBand for HPCIn this chapter we will explore how InfiniBand’s fundamental characteristicsmap onto addressing the problems being faced today in high performancecomputing environments.High performance computing is not a monolithic field; it covers the gamutfrom the largest “Top 500” class supercomputers down to small desktopclusters. For our purposes, we will generally categorize HPC as being that classof systems where nearly all of the available compute capacity is devoted tosolving a single large problem for substantial periods of time. Looked at anotherway, HPC systems generally don’t run traditional enterprise applications suchas mail, accounting or productivity applications.Some examples of HPC applications include atmospheric modeling,genomics research, automotive crash test simulations, oil and gas extractionmodels and fluid dynamics.HPC systems rely on a combination of high-performance storage and lowlatency InterProcess Communication (IPC) to deliver performance and scalability to scientific applications. This chapter focuses on how the InfiniBandArchitecture’s characteristics make it uniquely suited for HPC. With respect tohigh performance computing, InfiniBand’s low-latency/high-bandwidth performance and the nature of a channel architecture are crucially important.Ultra-low latency for: Scalability C

ter, overseeing the design and development of an enterprise data center, deploying servers and networks, or building devices/appliances targeted at solving a specific . Although the working title for this book was “InfiniBand for Dummies,” its readers are anything but dummies. Th