High-Speed Data Plane And Network Functions Virtualization By . PDF Free Download

1y ago

19 Views

1 Downloads

1.10 MB

22 Pages

Report/dmca

Download PDF

Transcription

High-Speed Data Plane and Network Functions Virtualizationby Vectorizing Packet ProcessingLeonardo Linguaglossa1 , Dario Rossi1 , Salvatore Pontarelli2 , Dave Barach3 ,Damjan Marjon3 , Pierre Pfister31 TelecomParisTech, 2 CNIT and University of Rome Tor Vergata, 3 Cisco Systems, Inc.first. last@ telecom-paristech. frAbstractIn the last decade, a number of frameworks started to appear that implement, directly in user-space with kernel-bypassmode, high-speed software data plane functionalities on commodity hardware. This may be the key to replace specifichardware-based middleboxes with custom pieces of software, as advocated by the recent Network Function Virtualization(NFV) paradigm. Vector Packet Processor (VPP) is one of such frameworks, representing an interesting point in thedesign space in that it offers: (i) in user-space networking, (ii) the flexibility of a modular router (Click and variants) with(iii) high-speed performance (several millions of packets per second on a single CPU core), achieved through techniquessuch as batch processing that have become commonplace in high-speed networking stacks (e.g. netmap or DPDK).Similarly to Click, VPP lets users arrange functions as a processing graph, providing a full-blown stack of networkfunctions. However, unlike Click where the whole tree is traversed for each packet, in VPP each traversed node processesall packets in the batch (or vector ) before moving to the next node. This design choice enables several code optimizationsthat greatly improve the achievable throughput. This paper introduces the main VPP concepts and architecture, andexperimentally evaluates the impact of its design choices (such as batch packet processing) on its performance.Keywords: Software routers, High-speed networking, Vector Packet Processing, Kernel-bypass1. Introductioncircumvent the lack of flexibility in network equipment isrepresented by the Click modular router [30]: its main ideaSoftware implementation of networking stacks offers ais to move some of the network-related functionalities, upconvenient paradigm for the deployment of new functionto then performed by specialized hardware, into softwarealities, and as such provides an effective way to escapefunctions to be run by general-purpose COTS equipment.from the network ossification. As a consequence, the pastTo achieve this goal, Click offers a programming languagetwo decades have seen tremendous advances in softwareto assemble software routers by creating and linking softbased network elements, capable of advanced data planeware functions, which can then be compiled and executedfunctions in common off-the-shelf (COTS) hardware. Thisin a general-purpose operating system. While appealing,eventually evolved in new paradigms such as Network Functhis approach is not without downsides: in particular, thetion Virtualization (NFV), which proposes that classicaloriginal Click placed most of the high-speed functionalitiesmiddleboxes implementing regular network functions canas close as possible to the hardware, which were thus imbe replaced by pieces of software executing virtual netplemented as separate kernel modules. However, whereas awork functions (VNFs). One of the seminal attempts tokernel module can directly access a hardware device, userPreprint submitted to Computer NetworksOctober 19, 2018

Lock-free multi-threaded (LFMT)KBnets applicationSoftwareKernelUserand use the kernel as an intermediate step, thus addingoverhead to the main task.More generally, thanks to the recent improvements inPacket ringL3 cacheL2L1 iL1d L1dCore Core12Hardwareat wire-speed among multiple interfaces [11]. As such, aSocket nodeNUMA nodemmap (DMA)transmission speed and network cards capabilities, a generalpurpose kernel stack is far too slow for processing packetsDDR memoryBuﬀer (I/O Batch)Hw Queues (RSS)Network InterfaceCard (NIC)tendency has emerged to implement high-speed stacks by-Processorhierarchyspace applications need to explicitly perform system-calls. .reWipassing operating system kernels (aka kernel-bypass networks, referred to as KBnets in what follows) and bringingFigure 1: The architecture of a typical COTS software router withthe hardware abstraction directly to the user-space, withspecial attention the NIC (left) and the memory hierarchy (right).a number of efforts (cfr. Sec. 2) targeting (i) low-levelprocessing advantages are offset by the overhead of man-building blocks for kernel bypass like netmap [38] and theaging linked-lists).Intel Data Plane Development Kit (DPDK), (ii) very spe-Our previous analysis [10] shows that VPP can sus-cific functions [24, 39, 36, 33] or (iii) full-blown modulartain a rate of up to 12 millions packets per second (Mpps)frameworks for packet processing [32, 28, 11, 37, 12].with our general-purpose CPU. We extend our perfor-In this manuscript, we describe Vector Packet Pro-mance evaluation, proving highly predictable performancecessor (VPP), a framework for building high-speed data(as shown by the constrained latency), a throughput scale-plane functions in software. Though being released as openup (depending on whether we increase the CPU frequency,source only recently, specifically within the Linux Founda-use Hyper-Threading or simply allocate more cores) andtion project “Fast Data IO” (FD.io) [20], VPP is already athe possibility to adopt VPP for different use-cases (testi-mature software stack in use in rather diverse applicationfied by the different scenarios evaluated).domains, ranging from Virtual Switch (SW) in data-centerIn the rest of the article, we put VPP in the contextto support virtual machines [7] and inter-container net-of related kernel-bypass effort (Sec. 2). We then introduceworking [3], as well as virtual network function in differentthe main architectural ingredients behind VPP (Sec. 3),contexts such as 4G/5G [5] and security [6].and assess their benefits with an experimental approach,In a nutshell, VPP combines the flexibility of a modularby describing both our methodology (Sec. 4) and our re-router (retaining a programming model similar to that ofsults (Sec. 5). We finally discuss our findings and reportClick) with high-speed performance. Additionally, it doeson open questions in Sec. 6. In line with the push towardso in a very effective way, by extending benefits broughtresearch reproducibility, we make all scripts available inby techniques such as batch processing to the whole packet[9] alongside with instructions to replicate our work.processing path, increasing as much as possible the numberof instructions per clock cycle (IPC) executed by the mi-2. Backgroundcroprocessor. This is in contrast with existing batch processing techniques, that are merely used to either reduceThis section overviews the state-of-the-art of softwareinterrupt pressure (e.g., as done by lower-building blocksdata plane: Sec. 2.1 introduces popular and new tech-such as [38, 4, 18]) or are non-systematic and pay the priceniques and Sec. 2.2 maps them to existing KBnets frame-of a high-level implementation (e.g., in FastClick [11] batchworks, of which we provide a compact summary in Tab. 1.2

2.1. KBnets-related techniquescessing. It is worth pointing out that VPP extends batchBy definition, KBnets avoid the overhead associatedprocessing from pure I/O (which reduces interrupt over-to kernel-level system calls: to achieve so, they employhead) to complete graph processing (which ameliorates thea plethora of techniques, which we overview with respectefficiency of the CPU pipeline).to Fig. 1, that illustrates the general COTS architecturewe consider in this work. We now describe in details theRSS Queues. Modern NICs support multiple RX/TXcolumns of Table 1, by following the packet reception path.hardware queues, and Receive-side Scaling (RSS) [25] isthe technique used to assign packets to a specific queue.Lock-free Multi-threading (LFMT). Current tendencyRSS queues are generally accessible in userland and areto multi-core COTS allows to embrace a multi-thread pro-typically used for hardware-based packet classification orgramming paradigm.Ideally, the parallelism degree ofto assist (per-flow) multi-thread processing. Dependinga network application, which represents the number ofon hardware capabilities, packets can be simply groupedthreads simultaneously running, is related to a speed-up inin flows by means of a hash function over their 5-tuplethe performance of the system: the more threads available,(grouping both directions is also trivial [42]), but recentthe better the performance, up to a saturation point whereNICs support more involved matching (e.g., up to 4096-bitincreasing the number of threads does not affect the perfor-hash filtering, which the framework needs to make acces-mance. At the same time, to achieve this ideal speed-up,sible in userland).it is imperative to avoid performance issues tied to theuse of synchronization techniques (mutexes, semaphores,Zero-Copy (Z-C). When the Network Interface Cardetc.) in order to achieve lock-free operations. Lock-free(NIC) has some packets available, it writes them to a re-parallelism in KBnets is tightly coupled with the avail-served memory region, which is shared between the net-ability1 of multiple hardware queues (discussed next), towork interface and the operating system. Early KBnets re-let threads operate on independent traffic subsets.quired user-space applications to access this memory throughsystem calls (i.e., a memory copy operation), whereas inI/O batching (IOB). Upon reception of a packet, themost of the latest KBnets approaches the user-space ap-NIC writes it to one of its hardware queues. To avoid rais-plication has Direct Memory Access to the memory regioning an interrupt as soon as a new packet is ready to beused by the NIC. Notice that zero-copy is sustainable onlyscheduled for processing, KBnets batch packets in a sepa-in case the packet consumer is faster than the packet ar-rate buffer and send an interrupt once the whole buffer isrival rate (occasional slowdown may need application-levelfull: the NIC simply writes the buffer via Direct Memorybuffers or drops). Typical Zero-Copy approaches leverageAccess (DMA) and appends a reference to its position tothe fact that most of the network applications performthe packet ring (a circular buffer of memory accessible bylittle to no operations to the packet payload, which canboth the network device and the user-space application).therefore be left unchanged, and only a subset of metadataOverall, I/O batching amortizes the overhead due to the(usually the L2/L3 addresses) is provided to the user-spaceinterrupt processing, and can speed-up the overall pro-application via the NIC driver.1 Evenwhen a single receive queue is available, a software sched-Compute Batching (CB). When packets are retrieveduler (potentially the system bottleneck) can assign different packetsfrom the NIC, the software can start the packet processing.to different threads, and then perform independent processing.3

Table 1: State of the art in KBnet XXXName-based fwdHCS[33]XXXXCachingModularityLow-level Click[11]XXXXpartialBESS/SoftNIC[23]XXXXXXVPP (this work)XXXXXXXXThis is typically performed on a per-packet basis, that is,Cache Coherence & Locality (CC&L). A major bot-all network functions in the processing path are applied totleneck for software architecture is nowadays representedthe same packet until the final forwarding decision. Withby memory access [13]. At hardware level, current COTScompute batching, the notion of I/O batching is extendedarchitectures counter this by offering 3 levels of cache mem-to the processing itself, meaning that the network func-ories with a faster access time (for the sake of illustration,tions are implemented to treat natively batches of packetsthe cache hierarchy of an Intel Core i7 CPU [17] is re-rather than single packets. Similarly to the I/O batch-ported in the right of Fig. 1). In general, L1 cache (di-ing, CB helps mitigating the overhead of function accessesvided into instruction L1-i and data L1-d) is accessed on(e.g. context switch, stack initialization) as well as pro-per-core/processor basis, L2 caches are either per-core orviding additional computational benefits: when a batch ofshared among multiple cores, and L3 caches are sharedpackets enters the function, the code of the function willwithin a NUMA node. The speed-up provided by thebe loaded and a miss in the L1 instruction cache will oc-cache hierarchy is significant: access time of a L1 cachecur only for the first packet, while for all other packets theis about 1ns, whereas access time to a L2 (L3) cache iscode to be run will be already in the cache; furthermore,about 10ns (60ns) and in the order of 100ns for the mainthe code might be written to exploit low-level parallelismDDR memory. When the data is not present at a given(i.e. when instructions on subsequent packets are inde-level of the cache hierarchy, a cache miss occurs forcingpendent) in order to increase the number of instructionsaccess to the higher levels, slowing the overall application.issued at the same clock cycle. CB can provide performance speed-up or not, depending on how systematic it isLow-level parallelism (LLP). Together with the user-the implementation of CB techniques (cfr. Sec. 5.2).land parallelism, a lower level of parallelism can be achievedby exploiting the underlying CPU micro-architecture, which4

consists of a multiple stages pipeline (instruction fetchmentioning are also the eXpress Data Path (XDP)[41],or load store register are two examples of such stages),which embraces similar principles but in a kernel-level ap-one or more arithmetical-logical units (ALU) and branchproach, and the Open Data Plane (ODP) project [8], anpredictors to detect ”if” conditions (which may cause pipeline open-source cross-platform set of APIs running on sevinvalidation) and maintain the pipeline fully running [26].eral hardware platforms such as x86 servers or networkingAn efficient code leads to (i) an optimal utilization of theSystem-on-Chip processors.pipelines and (ii) a higher degree of parallelism (that is,executing multiple instructions per clock cycle). Further-Purpose-specific prototypes. Another class of workmore, giving ”hints” to the compiler (e.g. when the prob-is represented by prototypes that are capable of a veryability of some ”if condition” is known to be very high)specific and restrained set of capabilities such as IP rout-can also improve the throughput. As we shall see, the vec-ing [24], traffic classification [39], name-based forwardingtorized processing, coupled with particular coding prac-[36] or transparent hierarchical caching [33]. In spite oftices, can exploit the underlying architecture, thus result-the different goals, and the possible use of network proces-ing in a better throughput for user-space packet process-sors [36] or GPUs [24], a number of commonalities arise.ing. To the best of our knowledge, VPP is among (if notPacketShader [24] is a GPU accelerated software IP router.the) first approaches to leverage systematic low-level par-In terms of low-level functions it provides kernel-bypassallelism through its design and coding practices.and I/O batching, but not zero-copy. MTclass [39] is aCPU-only traffic classification engine capable of working at2.2. KBnets frameworksline rate, employing a multi-thread lock-free programmingWe identify three branches of software frameworks forparadigm; at low-level, MTclass uses PacketShader hencehigh-speed packet processing based on kernel bypass, de-inheriting the aforementioned limitations. Prototypes inpending on whether they target low-level building blocks[36, 29] and [33] address high-speed solutions of two spe-[38, 4, 18], very specific functions [24, 39, 36, 33] or full-cific functions related to Information-centric networkingblown modular frameworks [32, 28, 11, 37, 12].(ICN) architectures, namely name-based forwarding andcaching ([36] employs a network processor whereas [29, 33]Low-level building blocks. This class of work has re-uses DPDK). In all these cases, multi-thread lock-free pro-ceived quite a lot of attention, with valuable frameworksgramming enabled by RSS queues is the key to scale-upsuch as netmap [38], DPDK [4] and PF RING [18]. Inoperations in user-space. In contrast to these efforts, VPPterms of features, most of them support high-speed I/Oaims for generality, feature richness and consistent perfor-through kernel-bypass, zero-copy, I/O batching and multi-mance irrespectively of the specific purpose.2queuing, though subtle differences may still arise amongframeworks [11] and their performance [21]. A more de-Full-blown modular frameworks. Full-blown modulartailed comparison of features available in a larger numberframeworks are closer in scope to VPP. Letting aside rel-of low-level frameworks is available at [11], whereas an ex-evant but proprietary stacks [2], work such as [37, 12, 32,perimental comparison of DPDK, PF RING and netmap28, 11] is worth citing. In more details, Arrakis [37] and(for relatively simple tasks) is available at [21]. WorthIX [12] are complete environments for building networkprototypes, including I/O stack and software processing2Alimitation of netmap that it does not allow to directly accessmodel. Arrakis’ main goal is to push further kernel-bypassthe NIC’s registers [19]5

beyond network stack functionalities, whereas IX additionally separates some functions of the kernel (control plane)Vector PacketProcessor (VPP)Kernel Userlandfrom network processing (data plane), and is as such wellsuited to building SDN applications.Closest work to VPP is represented by Click [30], whichshares the goal of building a flexible and fully programmablesoftware router. Whereas the original Click cannot belisted among KBnets applications (as it requires a cus-netmapixgbeDPDKODPXDPFigure 2: VPP scope and processing tree. The nodes we use latertom kernel, and runs in kernel-mode thus being not suitedin the experiments (as well as their neighbors) are highlighted in redfor high-speed processing), however a number of extensions(IP) and green (Eth). Process nodes are depicted in blue.have over the years brought elements of KBnets into Click.Especially, RouteBricks [32], DoubleClick [28], FastClick [11]all support the kernel version of Click, introducing sup-is a framework for high-speed packet processing in userspace, designed to take advantage of general-purpose CPUport for HW multi-queue [32], batching [28], and high-architectures. In contrast with frameworks whose first aimspeed processing [11], possibly obtained through a dedi-is performance on a limited set of functionalities, VPP iscated network processor [31]. Click has also inspired workfeature-rich (it implements a full network stack, includingsuch as the Berkeley Extensible Software Switch (BESS,functionalities at layer 2, 3 and above), and is designedformerly known as SoftNIC) [23] which presents a simi-to be easily customizable. As illustrated in Fig. 2, VPPlar programming model and implements the most impor-aims at leveraging recent advances in the KBnets low-leveltant software acceleration techniques. Important differ-building blocks early illustrated: as such, VPP runs on topences among Click (and variants) and VPP arise in theof DPDK, netmap, etc. (and ODP, binding in progress)scheduling of packets in the processing graph, and are dis-used as input/output nodes to the VPP processing. It iscussed further in Sec. 3.to be noted that non-KBnets interfaces such as AF PACKETThe evolution of full-blown modular frameworks is thesockets or tap interfaces are also supported.key to enable NFV composition and deployment. Most ofthe recent NFV frameworks are in fact built on top of oneAt a glance. VPP’s processing paradigm follows a “run-of the aforementioned tools: we cite, for instance, ClickNFto-completion” model. First a batch of packets is polled[22], a modular stack for the composition of L2-L7 networkusing a KBnets interface (like DPDK), after which thefunctions built on FastClick; the E2 [34], a framework forfull batch is processed. Poll-mode is quite common as itcreating and managing virtual functions built on top ofincreases the processing throughput in high traffic con-BESS; NetBricks [35], a clean-slate approach that lever-ditions (but requires 100% CPU usage regardless of theages pure DPDK and the Rust [1] programming languagetraffic load conditions), whereas native compute batch isto provide high-speed NFV capabilities.a novel ingredient.VPP is written in C, and its sources comprise a set of3. VPP Architecturelow-level libraries for realizing custom packet processingapplications as well as a set of high-level libraries imple-Initially proposed in [16], VPP technology was recentlyreleased as open-source software, in the context of thementing specific processing tasks (e.g.FD.io Linux foundation project [20]. In a nutshell, VPPip4-lookup) representing the main core of the framework.6l2-input,

User-defined extensions, called plugins, may define addi-dpdk-inputVectortional functionalities or replace existing ones (e.g.,flowperpkt-plugin, dpdk-plugin). The main core andStep 1:RX from NICStep 2:IPv4 IPv6 IPv4 L2l2-inputip4-inputip6-inputParse vectorplugins together form a forwarding graph, which describesStep 3 to n-1:the possible paths a packet can follow during processing.In more details, VPP allows three sets of nodes: namelyPer-protocolbatch processingip4-lookupprocess, input, and internal (which can be terminatingip4-rewriteleaves, i.e., output nodes). Process nodes (blue nodes in.Step n:Fig. 2) do not participate in the packet forwarding, be-error-dropl2-outputTX to NICing simply software functions running on the main core3and reacting to timers or user-defined events. Input nodesFigure 3: Illustration of the vectorized packet processing model.abstract a NIC interface, and manage the initial vectorof packets. Internal nodes are traversed after an explicitnal vector has been completely processed. At that point,call by an input node or another internal node. For somethe process recurs.nodes, a set of fine-grained processing tasks (aka features4Notice that not all packets followthe same path in the forwarding graph (i.e., vectors mayin VPP’s terminology) can be activated/deactivated onbe different from node to node). While it is outside ofdemand at runtime.the scope to provide a full account of all the availableVPP architecture adopts all well-known KBnets tech-nodes [40], Fig. 3 compactly depicts a subset of the fullniques discussed in Sec. 2, to which it adds a design (Sec. 3.1)VPP graph (comprising 253 nodes and 1479 edges) andand coding practices (Sec. 3.2) that are explicitly tailoredillustrates the vectorized processing. We consider a caseto (i) minimize the data cache misses using data prefetch-of a vector consisting in a mixture of traffic, and thening, (ii) minimize the instruction cache misses, (iii) in-focus on classical IPv4 processing for the sake of the ex-crease the instructions per cycle that the CPU front-endample. Notice how the overall processing is decoupled incan fetch, and that we describe in what follows.different components, each of them implemented by a separate node. VPP’s workflow begins with a node devoted3.1. Vectorized processingto packet reception (dpdk-input), and then the full vecThe main novelty of VPP is to offer a systematic waytor is passed to the next node dealing with packet parsingto efficiently process packets in a “vectorized” fashion: in-(l2-input). Here the vector can be split in case of mul-stead of letting each packet traverse the whole forwardingtiple protocols to process. After this step, we enter thegraph, each node processes all packets in a batch, whichIPv4 routing procedure (split in ip4-input, ip4-lookup,provides sizable performance benefits (cfr. Sec. 5.2).ip4-rewrite). The workflow finally ends in a forwardingInput nodes produce a vector of work to process: thedecision (l2-forward) for the re-assembled vector. A dropgraph node dispatcher pushes the vector through the di-decision may be taken at every step (error-drop).rected graph, subdividing it as needed, until the origifeatures its own internal implementation of cooperativeAdvantages of vectorized processing. In a classicmultitasking [14], which allows running of multiple process nodes“run-to-completion” [30, 11] approach, different functionson the main core.4 Which are not functions and thus do not incur function callof the graph are applied to the same packet, generatingoverhead.a significant performance penalty. This penalty is due to3 VPP7

several factors. (i) The instruction cache miss rate in-VPP extends I/O batching to the upper layers of KBnetscreases when a different function has to be loaded andprocessing. However, if the goal of batching I/O oper-the instruction cache is already full. (ii) There is a non-ations is to reduce the interrupt frequency, the goal ofnegligible framework overhead tied to the selection of thevectorized processing is to decrease the overall numbersnext node to access in the forwarding graph and to its func-of clock cycles needed to process a packet, amortizing thetion call. (iii) It is difficult to define a prefetching strategyoverhead of the framework over the batch. These two goalsthat can be applied to all nodes, since the next executionare complementary.node is unknown and since each node may require to accessto a different portion of the packet data.Vector processing vs Compute Batching. It is worthVPP exploits a “per-node batch processing” to mini-pointing out that tools such as G-opt [27], FastClick [11]mize these effects. In fact, since a node function is appliedand the pipelines of the DPDK Packet Framework [4] doto all the packets in the batch, instruction misses can oc-offer some form of “Compute Batching”, which howevercur only for the first packet of the batch (for a reasonablebarely resemble to batching in VPP only from a very high-codesize of a node). Moreover, the framework overheadlevel view, as several fundamental differences arise on ais shared among all the packets of the batch, so the per-closer look. In G-opt batching serves only the purpose ofpacket overhead becomes negligible when the batch sizeavoiding CPU stalls due to memory latency. The pipelineis of hundreds of packets. Finally, this processing enablesmodel of DPDK Packet Framework is used to share thean efficient data prefetching strategy. When the node isprocessing among different CPUs and is not focused on im-called, it is known which packet data (e.g. which headers)proving the performance on a single core. Instead, FastClickare necessary to perform the processing task. This allows“Compute Batching” (see Sec 5.7 in [11]) and the BESS [23]to prefetch the data for the (i 1)-th packet while the nodeimplementation, are close in spirit to VPP.processes the data of the i-th packet.However, the functions implemented in any VPP nodeare designed to systematically process vectors of packets.Vector size. Notice that while the maximum amount ofThis natively improves performance and allows code opti-packets per vector can be controlled, the actual numbermization (data prefetching, multi-loop, etc). In contrast,of packets processed depends on the duration of the pro-nodes in FastClick implicitly process packets individually,cessing tasks, and on the number of new packets arrivedand only specific nodes have been augmented to also ac-during this time. Intuitively, in case of a sudden increasecept batched input. Indeed, per-vector processing is ain the arrival rate, the next vector will be longer. How-fundamental primitive in VPP. Vectors are pre-allocatedever, the processing efficiency (measured in terms of clockarrays residing in contiguous portions of memory, whichcycles per packet) increases by amortizing fixed costs overare never freed, but efficiently managed in re-use lists.a larger number of elements. Thus, when a batch is larger,Additionally, vector elements are 32-bit integers that areit is expected that the processing time of each packet inmapped to 64-bit pointers to the DMA region holding thethe batch decreases, and so does the size of the subsequentpacket with an affine operation (i.e., multiplication andinput vector. In practice, this sort of feedback loop helpsoffset that are performed with a single PMADDWD x86 in-maintaining a stable equilibrium in the vector size.struction). In FastClick, batches are constructed by usingthe simple linked list implementation available in Click,with significantly higher memory occupancy (inherentlyVector processing vs I/O Batching. In some sense,8

less cacheable) and higher overhead (adding further 64-bitremoving data dependency, but does not provide benefitpointers to manage the list).when the performance bottleneck is due to the number ofUltimately, these low-level differences translate into quitememory accesses.diverse performance benefits. VPP’s vectorized processingis lightweight and systematic (as for the BESS computeData prefetching. Once a node is called, it is possiblebatching): in turn, processing vectors of packets increaseto prefetch the data that the node will use for the i 1-the throughput consistently, and our measurements con-th packet while processing the i-th packet. Prefetchingfirm that the treatment of individual packets significantlycan be combined with multi-loop, i.e. prefetching data forspeeds up. In contrast, opportunistic batching/splittingpackets from i 1 to i N while processing packets fromoverhead in FastClick, coupled to linked list managementi N to i. This optimization does not work at the vectoryields limited achievable benefits in some cases and nonebounds: for the first N

functions in common o -the-shelf (COTS) hardware. This eventually evolved in new paradigms such as Network Func-tion Virtualization (NFV), which proposes that classical middleboxes implementing regular network functions can be replaced by pieces of software executing virtual net-work functions (VNFs). One of the seminal attempts to