Forwarding Metamorphosis: Fast Programmable Match-Action .

Transcription

Forwarding Metamorphosis: Fast ProgrammableMatch-Action Processing in Hardware for SDN(Extended Version)Pat Bosshart† , Glen Gibb‡ , Hun-Seok Kim† , George Varghese§ , Nick McKeown‡ ,Martin Izzard† , Fernando Mujica† , Mark Horowitz‡†Texas Instruments‡Stanford University§Microsoft Researchpat.bosshart@gmail.com {grg, nickm, horowitz}@stanford.eduvarghese@microsoft.com {hkim, izzard, fmujica}@ti.comABSTRACTrouting and forwarding within the network remain aconfusing conglomerate of routing protocols (e.g., BGP,ICMP, MPLS) and forwarding behaviors (e.g., routers,bridges, firewalls), and the control and forwarding planesremain intertwined inside closed, vertically integratedboxes.Software-defined networking (SDN) took a key step inabstracting network functions by separating the roles ofthe control and forwarding planes via an open interfacebetween them (e.g., OpenFlow [27]). The control planeis lifted up and out of the switch, placing it in externalsoftware. This programmatic control of the forwarding plane allows network owners to add new functionality to their network, while replicating the behavior ofexisting protocols. OpenFlow has become quite wellknown as an interface between the control plane andthe forwarding plane based on the approach known as“Match-Action”. Roughly, a subset of packet bytes arematched against a table; the matched entry specifies acorresponding action(s) that are applied to the packet.One can imagine implementing Match-Action in software on a general purpose CPU. But for the speeds weare interested in—about 1 Tb/s today—we need theparallelism of dedicated hardware. Switching chips haveremained two orders of magnitude faster at switchingthan CPUs for a decade, and an order of magnitudefaster than network processors, and the trend is unlikely to change. We therefore need to think throughhow to implement Match-Action in hardware to exploitpipelining and parallelism, while living within the constraints of on-chip table memories.There is a natural tradeoff between programmabilityand speed. Today, supporting new features frequentlyrequires replacing the hardware. If Match-Action hardware permitted (just) enough reconfiguration in the fieldso that new types of packet processing could be supported at run-time, then it would change how we thinkof programming the network. The real question hereis whether it can be done at reasonable cost withoutsacrificing speed.In Software Defined Networking (SDN) the control planeis physically separate from the forwarding plane. Controlsoftware programs the forwarding plane (e.g., switches androuters) using an open interface, such as OpenFlow. Thispaper aims to overcomes two limitations in current switching chips and the OpenFlow protocol: i) current hardwareswitches are quite rigid, allowing “Match-Action” processing on only a fixed set of fields, and ii) the OpenFlow specification only defines a limited repertoire of packet processing actions. We propose the RMT (reconfigurable match tables) model, a new RISC-inspired pipelined architecture forswitching chips, and we identify the essential minimal setof action primitives to specify how headers are processed inhardware. RMT allows the forwarding plane to be changedin the field without modifying hardware. As in OpenFlow,the programmer can specify multiple match tables of arbitrary width and depth, subject only to an overall resourcelimit, with each table configurable for matching on arbitraryfields. However, RMT allows the programmer to modifyall header fields much more comprehensively than in OpenFlow. Our paper describes the design of a 64 port by 10 Gb/sswitch chip implementing the RMT model. Our concrete design demonstrates, contrary to concerns within the community, that flexible OpenFlow hardware switch implementations are feasible at almost no additional cost or power.1.INTRODUCTIONTo improve is to change; to be perfect is tochange often.— ChurchillGood abstractions—such as virtual memory and timesharing—are paramount in computer systems becausethey allow systems to deal with change and allow simplicity of programming at the next higher layer. Networking has progressed because of key abstractions: TCPprovides the abstraction of connected queues betweenendpoints, and IP provides a simple datagram abstraction from an endpoint to the network edge. However,1

Single Match Table: The simplest approach is toabstract matching semantics in what we call the SMT(Single Match Table) model. In SMT, the controllertells the switch to match any set of packet header fieldsagainst entries in a single match table. SMT assumesthat a parser locates and extracts the correct headerfields to match against the table. For example, an Ethernet packet may have an optional MPLS tag, whichmeans the IP header can be in two different locations.The match is a binary exact match when all fields arecompletely specified, and is ternary for matches wheresome bits are switched off (wildcard entries). Superficially, the SMT abstraction is good for both programmers (what could be simpler than a single match?) andimplementers (SMT can be implemented using a wideTernary Content Addressable Memory (TCAM)). Notethat the forwarding data plane abstraction has the mostrigorous hardware implementation constraints becauseforwarding is often required to operate at about 1 Tb/s.A closer look, however, shows that the SMT modelis costly in use because of a classic problem. The table needs to store every combination of headers; thisis wasteful if the header behaviors are orthogonal (theentries will have many wildcard bits). It can be evenmore wasteful if one header match affects another, forexample if a match on the first header determines a disjoint set of values to match on the second header (e.g.,in a virtual router [10]), requiring the table to hold theCartesian-product of both.Multiple Match Tables: MMT (Multiple MatchTables) is a natural refinement of the SMT model. MMTgoes beyond SMT in an important way: it allows multiple smaller match tables to be matched by a subsetof packet fields. The match tables are arranged into apipeline of stages; processing at stage j can be made todepend on processing from stage i j by stage i modifying the packet headers or other information passedto stage j. MMT is easy to implement using a set ofnarrower tables in each stage; in fact, it is close enoughto the way existing switch chips are implemented tomake it easy to map onto existing pipelines [2,14,23,28].Google reports converting their entire private WAN tothis approach using merchant switch chips [13].The OpenFlow specification transitioned to the MMTmodel [31] but does not mandate the width, depth, oreven the number of tables, leaving implementors freeto choose their multiple tables as they wish. While anumber of fields have been standardized (e.g., IP andEthernet fields), OpenFlow allows the introduction ofnew match fields through a user-defined field facility.Existing switch chips implement a small (4–8) number of tables whose widths, depths, and execution orderare set when the chip is fabricated. But this severelylimits flexibility. A chip used for a core router may require a very large 32-bit IP longest matching table anda small 128 bit ACL match table; a chip used for anL2 bridge may wish to have a 48-bit destination MACaddress match table and a second 48-bit source MACaddress learning table; an enterprise router may wish tohave a smaller 32-bit IP prefix table and a much largerACL table as well as some MAC address match tables.Fabricating separate chips for each use case is inefficient,and so merchant switch chips tend to be designed tosupport the superset of all common configurations, witha set of fixed size tables arranged in a pre-determinedpipeline order. This creates a problem for network owners who want to tune the table sizes to optimize fortheir network, or implement new forwarding behaviorsbeyond those defined by existing standards. In practice,MMT often translates to fixed multiple match tables.A second subtler problem is that switch chips offeronly a limited repertoire of actions corresponding tocommon processing behaviors, e.g., forwarding, dropping, decrementing TTLs, pushing VLAN or MPLS headers, and GRE encapsulation. And to date, OpenFlowspecifies only a subset of these. This action set is noteasily extensible, and is also not very abstract. A moreabstract set of actions would allow any field to be modified, any state machine associated with the packet to beupdated, and the packet to be forwarded to an arbitraryset of output ports.Reconfigurable Match Tables: Thus in this paper, we explore a refinement of the MMT model that wecall RMT (Reconfigurable Match Tables). Like MMT,ideal RMT would allow a set of pipeline stages eachwith a match table of arbitrary depth and width. RMTgoes beyond MMT by allowing the data plane to bereconfigured in the following four ways. First, field definitions can be altered and new fields added; second,the number, topology, widths, and depths of match tables can be specified, subject only to an overall resourcelimit on the number of matched bits; third, new actionsmay be defined, such as writing new congestion fields;fourth, arbitrarily modified packets can be placed inspecified queue(s), for output at any subset of ports,with a queuing discipline specified for each queue. Thisconfiguration should be managed by an SDN controller,but we do not define the control protocol in this paper.The benefits of RMT can be seen by considering newprotocols proposed in the last few years, such as PBB [16],VxLAN [22], NVGRE [19], STT [21], and OTV [20].Each protocol defines new header fields. Without an architecture like RMT, new hardware would be requiredto match on and process these protocols.Note that RMT is perfectly compatible with (andeven partly implemented by) the current OpenFlow specification. Individual chips can clearly allow an interfaceto reconfigure the data plane. In fact, some existingchips, driven at least in part by the need to addressmultiple market segments, already have some flavors of2

reconfigurability that can be expressed using ad hoc interfaces to the chip.Many researchers have recognized the need for something akin to RMT and have advocated for it. For example, the IETF ForCES working group developed thedefinition of a flexible data plane [17]; similarly, the forwarding abstraction working group in ONF has workedon reconfigurability [30]. However, there has been understandable skepticism that the RMT model is implementable at very high speeds. Without a chip to provide an existence proof of RMT, it has seemed fruitlessto standardize the reconfiguration interface between thecontroller and the data plane.Intuitively, arbitrary reconfigurability at terabit speedsseems an impossible mission. But what restricted formof reconfigurability is feasible at these speeds? Does therestricted reconfigurability cover a large enough fractionof the needs we alluded to earlier? Can one prove feasibility via working silicon that embodies these ideas?How expensive is such an RMT chip compared to afixed-table MMT chip? These are the questions we address in this paper.General purpose payload processing is not our goal.SDN/OpenFlow (and our design) aim to identify theessential minimal set of primitives to process headers inhardware. Think of it as a minimal instruction set likeRISC, designed to run really fast in heavily pipelinedhardware. Our very flexible design is cost-competitivewith fixed designs—i.e., flexibility comes at almost nocost.Paper Contributions: Our paper makes a concretecontribution to the debate of what forwarding abstractions are practical at high speed, and the extent towhich a forwarding plane can be reconfigured by thecontrol plane. Specifically, we address the questionsabove as follows:1. An architecture for RMT (§2): We describe anRMT switch architecture that allows definition of arbitrary headers and header sequences, arbitrary matching of fields by an arbitrary number of tables, arbitrary writing of packet header fields (but not the packetbody), and state update per packet. Several restrictionsare introduced to make the architecture realizable. Weoutline how a desired configuration can be expressed bya parse graph to define headers, and a table flow graphto express the match table topology.2. Use cases (§3): We provide use cases that showhow the RMT model can be configured to implementforwarding using Ethernet and IP headers, and supportRCP [7].3. Chip design and cost (§4–5): We show that thespecific form of reconfigurability we advocate is indeedfeasible and describe the implementation of a 640 Gb/s(64 10 Gb/s) switch chip. Our architecture and implementation study included significant detail in logicand circuit design, floorplanning and layout, using techniques proven over the design team’s long history ofdeveloping complex digital ICs. An industry standard28nm process was used. This work is necessary to provefeasibility in meeting goals such as timing and chip area(cost). We have not produced a complete design oractual silicon. Based on our investigation, we showthat the cost of reconfiguration is expected to be modest: less than 20% beyond the cost of a fixed (nonreconfigurable) version.We make no claim that we are the first to advocatereconfigurable matching or that our proposed reconfiguration functionality is the “right” one. We do claim thatit is important to begin the conversation by making aconcrete definition of the RMT model and showing itis feasible by exhibiting a chip, as we have attemptedto do in this paper. While chip design is not normallythe province of SIGCOMM, our chip design shows thata rather general form of the RMT model is feasible andinexpensive. We show that the RMT model is not onlya good way to think about programming the network,but also lends itself to direct expression in hardware using a configurable pipeline of match tables and actionprocessors.2.RMT ARCHITECTUREWe spoke of RMT as “allow(ing) a set of pipelinestages . . . each with a match table of arbitrary depthand width that matches on fields”. A logical deductionis that an RMT switch consists of a parser, to enablematching on fields, followed by an arbitrary number ofmatch stages. Prudence suggests that we include somekind of queuing to handle congestion at the outputs.Let’s look a little deeper. The parser must allow fielddefinitions to be modified or added, implying a reconfigurable parser. The parser output is a packet headervector, which is a set of header fields such as IP dest,Ethernet dest, etc. In addition, the packet header vector includes “metadata” fields such as the input port onwhich the packet arrived and other router state variables (e.g., current size of router queues). The vectorflows through a sequence of logical match stages, eachof which abstracts a logical unit of packet processing(e.g., Ethernet or IP processing) in Figure 1a.Each logical match stage allows the match table sizeto be configured: for IP forwarding, for example, onemight want a match table of 256K 32-bit prefixes andfor Ethernet a match table of 64K 48-bit addresses.An input selector picks the fields to be matched upon.Packet modifications are done using a wide instruction(the VLIW—very long instruction word—block in Figure 1c) that can operate on all fields in the header vectorconcurrently.More precisely, there is an action unit for each fieldF in the header vector (Figure 1c), which can take up3

Logical Stage 1Switch utQueuesRecombineVLIWActionNew Header.SelectMatchTablesLogical Stage sticsKOutputChannels(a) RMT model as a sequence of logical Match-Action stages.PhysicalStage 1PhysicalStage 2PhysicalStage MSrc 1: Ingress logicalmatch tablesActionMemoryPacketHeaderVectorSrc 1Src 2Src 3ActionUnitMatchTablesMatchResultsOP code(from instmem).Logical Stage 2PacketHeaderVector.LogicalStage NVery WideHeader Bus.Action Input Selector (Crossbar)Logical Stage 1Src 3ActionUnitSrc 2OP codeCtrl: Egress logicalmatch tablesVLIW Instruction Memory(b) Flexible match table configuration.(c) VLIW action architecture.Figure 1: RMT model architecture.to three input arguments, including fields in the headervector and the action data results of the match, andrewrite F . Allowing each logical stage to rewrite every field may seem like overkill, but it is useful whenshifting headers; we show later that action unit costsare small compared to match tables. A logical MPLSstage may pop an MPLS header, shifting subsequentMPLS headers forward, while a logical IP stage maysimply decrement TTL. Instructions also allow limitedstate (e.g., counters) to be modified that may influencethe processing of subsequent packets.Control flow is realized by an additional output, nexttable-address, from each table match that provides theindex of the next table to execute. For example, a matchon a specific Ethertype in Stage 1 could direct laterprocessing stages to do prefix matching on IP (routing), while a different Ethertype could specify exactmatching on Ethernet DAs (bridging). A packet’s fateis controlled by updating a set of destination ports andqueues; this can be used to drop a packet, implementmulticast, or apply specified QoS such as a token bucket.A recombination block is required at the end of thepipeline to push header vector modifications back intothe packet (Figure 1a). Finally, the packet is placed inthe specified queues at the specified output ports and aconfigurable queuing discipline applied.In summary, the ideal RMT of Figure 1a allows newfields to be added by modifying the parser, new fields tobe matched by modifying match memories, new actionsby modifying stage instructions, and new queueing bymodifying the queue discipline for each queue. An idealRMT can simulate existing devices such as a bridge, arouter, or a firewall; and can implement existing protocols, such as MPLS and ECN, and protocols proposed inthe literature, such as RCP [7] that uses non-standardcongestion fields. Most importantly, it allows futuredata plane modifications without modifying hardware.2.1Implementation Architecture at 640Gb/sWe advocate an implementation architecture shown4

in Figure 1b that consists of a large number of physicalpipeline stages that a smaller number of logical RMTstages can be mapped to, depending on the resourceneeds of each logical stage. This implementation architecture is motivated by:1. Factoring State: Router forwarding typically hasseveral stages (e.g., forwarding, ACL), each of whichuses a separate table; combining these into one tableproduces the cross-product of states. Stages are processed sequentially with dependencies, so a physical pipelineis natural.2. Flexible Resource Allocation Minimizing ResourceWaste: A physical pipeline stage has some resources(e.g., CPU, memory). The resources needed for a logicalstage can vary considerably. For example, a firewall mayrequire all ACLs, a core router may require only prefixmatches, and an edge router may require some of each.By flexibly allocating physical stages to logical stages,one can reconfigure the pipeline to metamorphose froma firewall to a core router in the field. The numberof physical stages N should be large enough so that alogical stage that uses few resource will waste at most1/N -th of the resources. Of course, increasing N willincrease overhead (wiring, power): in our chip designwe chose N 32 as a compromise between reducingresource wastage and hardware overhead.3. Layout Optimality: As shown in Figure 1b, a logical stage can be assigned more memory by assigningthe logical stage to multiple contiguous physical stages.An alternate design is to assign each logical stage toa decoupled set of memories via a crossbar [3]. Whilethis design is more flexible (any memory bank can beallocated to any stage), worst case wire delays betweena processing stage a

routers) using an open interface, such as OpenFlow. This paper aims to overcomes two limitations in current switch-ing chips and the OpenFlow protocol: i) current hardware switches are quite rigid, allowing “Match-Action” process-ing on only a fixed set of fields, and ii) the OpenFlow spec-