Nexus 9000 Architecture - Isp-tech.ru

Transcription

Nexus 9000 ArchitectureMike Herbert - Principal Engineer, Cisco

The Original Agenda This intermediate level session will describe the Cisco Nexus 9000 architectureand innovations in terms of hardware, software, mechanical design, opticaladvantages in the 40 GE environment and power budget. The uniquecombination of Merchant silicon combined with Cisco internally developedASICs make this platform a leader in the Data Centre switch market. Thissession will also approach the Data Centre design aspect and describe theSpine-Leaf architecture advantages.3

The new Agenda – It is still is the N9K ArchitectureSession but with details on next Generation as wellIn the upcoming year, 2016, the industry will see a significant capacity, capability and costpoint shift in Data Centre switching. The introduction of 25/100G supplementing theprevious standard of 10/40G at the same cost points and power efficiency whichrepresents a 250% increase in capacity for roughly the same capital costs is just oneexample of the scope of the change. These changes are occurring due to the introductionof new generations of ASICs leveraging improvements in semiconductor fabricationcombined with innovative developments in network algorithms, SerDes capabilities andASIC design approaches. This session will take a deep dive look at the technologychanges enabling this shift and the architecture of the next generation nexus 9000 DataCentre switches enabled due to these changes. Topics will include a discussion of theintroduction of 25/50/100G to compliment existing 10/40G, why next generation fabricationtechniques enable much larger forwarding scale, more intelligent buffering and queuingalgorithms and embedded telemetry enabling big data analytics based on network traffic.4

Agenda Existing and New Nexus 9000 & 3000 What’s New Moore’s Law and 25G SerDesThe new building blocks (ASE-2, ASE-3, LSE)Examples of the Next Gen CapabilitiesNexus 9000 Switch Architecture Nexus 9200/9300 (Fixed) Nexus 9500 (Modular) 100G Optics5

Cisco Data Centre Networking Strategy:Providing Choice in Automation and ProgrammabilityApplication CentricInfrastructureProgrammable FabricProgrammable ult MgmtDBWebWebAppWebAppTurnkey integrated solution with security,centralised management, compliance andscaleAutomated application centric-policymodel with embedded securityBroad and deep ecosystemVxLAN-BGP EVPNstandard-basedModern NX-OS with enhanced NXAPIs3rd party controller supportDevOps toolset used for NetworkManagementCisco Controller for software overlayprovisioning and managementacross N2K-N9K(Puppet, Chef, Ansible etc.)Nexus 9400 (line cards), 9200, 3100, 3200Nexus 9700EX 9300EX

Over 6000 Nexus9K CustomersNexus 9000 Portfolio10/25/40/50/100G on Merchant or Cisco SiliconNexus 9300Nexus 9300EXNexus 950048p 10G & 4p 40GVXLAN routing option36p 40G ACI36p 25/40/50/100G32p 40G NX-OSACI & NX-OS32p 25/40/50/100GNX-OSExisting ChassisDelivering on InvestmentProtection Promise48p 10G & 6p 40G96p 10G & 6p 40G32p 40GContinued Support for AllExisting Line Cards Merchantand Cisco SiliconNexus 9504Nexus 9508Nexus 951648p 10/25G SFP & 6p 40/50/100G48p 10GT & 6p 40/50/100GIndustryOnly 25GNativeVXLANNexus 920036p wire rate 40/50/100G56p 40G 8p 40/50/100G72p 40G48p 10/25G SFP & 4p 40/50/100G 2p 40GIndustryOnly 25GNativeVXLAN

Continued Support of Broadcom SiliconNexus 3000: 10 Million Ports ShippedNexus 3100Nexus 3100VNexus 3200Shipping for3 months32p 40G32p 25/50/100G48p 10G & 6p 100G64p 40G Single ChipVXLAN routing, 100G uplinks, No 25GVXLAN bridging, 25/100GT2 Tomahawk64p 40G32p 40G48p 10G & 6p 40G48p 1G & 4p 10GSingle NX-OS Image for Nexus 3000 & Nexus 9000

Agenda Existing and New Nexus 9000 & 3000 What’s NewMoore’s Law and 25G SerDes The new building blocks (ASE-2, ASE-3, LSE) Examples of the Next Gen Capabilities Nexus 9000 Switch Architecture Nexus 9200/9300 (Fixed) Nexus 9500 (Modular) 100G Optics9

“The number of transistorsincorporated into a chipwill approximately doubleevery 24 months ”“Moore’s Law” - 197510

Moore’s LawCMOSinVDDVSSoutinoutp p n-welln n “Feature size”This dimension is what Moore’s Law is all about !!

Moore’s LawIt’s all about the Economics Increased function, efficiency Reduced costs, power 1.6 x increase in gates betweenprocess nodesThe new generation of Nexus 9000 isleveraging 16nm FF (FinFet)BCOM 40nm - 2013Cisco 28nm - 2014BCOM 28nm - 2016Cisco 16FF - 2016Intel 14nm - 2016http://en.wikipedia.org/wiki/Semiconductor device fabrication

SerDes: Serializer Deserializer SerDes Clocking Increases 10.3125G (40G, 10G) 25.78125(25G/50G/100G) - 201613

Multi Lane Distribution (MLD)MLD (Multi Lane Distribution) 40GE/100GE interfaces have multiple lanes (coax cables, fibres, wavelengths) MLD provides a simple (common) way to map 40G/100G to physical interfaces of differentlane widths

Parallel Lanes4 x10 40G shifts to 4 x 25 100G100-GbEBacked by 10G SerDesBacked by 25G SerDes

Development Cycle DecreasingFeatures and CapabilitiesTime to Leverage Moore’s Law is Reducing18 Month Dev CycleTick-TockClassical ASIC2 Year Dev Cycle201620172018201920202021

ASIC Used by Nexus 3000/9000ScaleASE & ALE Route/ Host tables Sharding Encapnormalisation EPG/ SGT/ NSHASE2, ASE3 &LSETelemetryMerchant Cisco40nmMerchantMerchant40nm28nmTrident T21st Gen Switches:2013–2015 Analytics Netflow Atomic Counters16nm28nmOptimisationTomahawkTrident 2 2nd Gen Switches:2016 Smart Buffers DLB/ FlowPrioritisation

ASIC Used by Nexus 3000/900016nm ASE2 – ACI Spine Engine 23.6 Tbps Forwarding (Line Rate for all packet sizes) 36x100GE, 72x40GE, 144x25GE, . ASE3 – ACI Spine Engine 31.6 Tbps Forwarding (Line Rate for all packet sizes)16x100GE, 36x40GE, 74x25GE, .Flow Table (Netflow, )ASE-2ASE-3 ASE-2ASE-3 Standalone leaf and spine, ACI spine16K VRF, 32 SPAN, 64K MCAST fan-outs, 4K NATMPLS: Label Edge Router (LER), Label Switch Router (LSR), Fast Re-Route(FRR), Null-label, EXP QoS classificationPush /Swap maximum of 5 VPN label 2 FRR label8 unicast 8 MulticastFlexible DWRR scheduler across 16 queuesActive Queue Management AFD ,WRED, ECN MarkingFlowlet Prioritisation & Elephant-Trap for trapping 5 tuple of large flows

ASIC Used by Nexus 3000/900016nmLSE LSE – Leaf Spine EngineStandalone leaf & spine, ACI leaf and spineFlow Table (Netflow, )ACI feature and service and security enhancement40MB Buffer32G fibre channel and 8 unified port25G and 50G RS FEC (clause 91)Energy Enhancement Ethernet, IEEE 802.3azPort TX SPAN support for multicastMPLS: Label Edge Router (LER), Label Switch Router (LSR), FastRe-Route (FRR), Null-label, EXP QoS classificationPush /Swap maximum of 5 VPN label 2 FRR label16K VRF, 32 SPAN, 64K MCAST fan-outs, 50K NAT8 unicast 8 MulticastFlexible DWRR scheduler across 16 queuesActive Queue Management AFD ,WRED, ECN MarkingFlowlet Prioritization, Elephant-Trap for trapping 5 tuple of large flows

ASIC Used by Nexus 3000/9000Merchant28nm Broadcom Tomahawk 3.2 Tbps I/O & 2.0 Tbps CoreTomahawk supports 3200 Gbps when average packet size is greater than 250bytes. When all ports are receiving 64 byte packets, throughput is 2000 GbpsTomahawk 32 x 100GE Standalone leaf and spine VXLAN Bridging Broadcom Trident 2 Trident 2 1.28Tbps I/O & 0.96T Core ( 192B pkt) 32 x 40GE (line rate for 24 x 40G) Standalone leaf and spine VXLAN Bridging & Routing (with-out recirculation)

Cisco Nexus 3000/9000 ASIC MappingASICFixed PlatformModular PlatformALE (ACI Leaf Engine)GEM Module (ACI Leaf/NX-OS)N9K-M12PQ, N9K-M6PQ(NX-OS)N9K-X9564PX, N9K-X9564TX,N9K-X9536PQALE2(ACI Leaf/NX-OS)N9K-C9372PX, N9K-C9372TX, N9K-C93120TX, N9K-C9332PQNAALE2(ACI Leaf/NX-OS)N9K-C9372PX-E, N9K-C9372TX-E, GEM: N9K-M6PQ-ENAASE (ACI Spine Engine)(ACI Spine)N9K-C9336PQ(ACI Spine)N9K-X9736PQASE2(NX-OS)N9K-C9236C, N9K-C92304QC, N9K-C9272Q(ACI Spine/NX-OS)N9K-C9504-FM-E, N9K-C9508-FM-EASE3(NX-OS)N9K-C92160YC-XNALSE (Leaf Spine Engine)(ACI Leaf/NX-OS)N9K-C93180YC-EX, N9K-C93108TC-EX(ACI Spine)N9K-C9372C-EX(ACI Spine/NX-OS)N9K-X9736C-EXNFE (Trident T2)(ACI Leaf/NX-OS)N9K-C9372PX(E), N9K-C9372TX (E), N9K-C93120TX,N9K-C9332PQ, N9K-C9396PX, N9K-C9396TX, N9K-C93128TXGEM Module (NX-OS)N9K-M4PC-CFP2(NX-OS)N9K-X9564PX, N9K-X9564TX,N9K-X9536PQ, N9K-X9464PX,N9K-X9464TX, N9K-X9432PQ,N9K-X9636PQNFE2 (Tomahawk)(NX-OS)N9K-X9432C-S

Agenda Existing and New Nexus 9000 & 3000 What’s NewMoore’s Law and 25G SerDes The new building blocks (ASE-2, ASE-3, LSE) Examples of the Next Gen Capabilities Nexus 9000 Switch Architecture Nexus 9200/9300 (Fixed) Nexus 9500 (Modular) 100G Optics22

Hyper-Converged FabricsContainers, Scale-OutStorage mixed withexisting VM and BareMetalDistributed IP storage for cloudapps and traditional storage(iSCSI/NAS,FC) for existing appsDistributed Apps via Containerbased Micro-servicesInter-process Communicationacross fabricData Centre ImplicationsOrder(s) of Magnitude increase indensity of endpointsIncreased I/O traffic driveslarger BandwidthMix of traffic types drives need forbetter queueing (not buffering)Security Density is Increasing aswell

Hypervisors vs. Linux ContainersContainers share the OS kernel of the host and thus are lightweight.However, each container must have the same OS kernel.Adoption / plan to useAppAppAppAppAppBins / libsBins / libsOperatingSystemVirtual MachineOperatingSystemVirtual MachineAppAppAppBins / libsBins / libsOperatingSystemVirtual MachineOperatingSystemVirtual MachineAppAppBins / libsHypervisorAppAppContainerContainerBins / libsHypervisorOperating SystemOperating SystemHardwareHardwareHardwareSource: RightScale 2015 State of the cloud reportType 1 HypervisorType 2 HypervisorLinux Containers (LXC)Containers are isolated, but share OSand, where appropriate, libs / bins.24

Ex.: Applications & Software developmentMonolithic Apps versus Cloud-Native App with Distributed DataCore Enterprise WorkloadsHigh CPU yCloud-Native at ScaleIaaS (if any)CRMEmailACI / ManyApplicationsManyServersGamingHigh Storage ormodularserversMobilePaaSIoTeCommerce

Bare Metal, Hypervisors, Containers & UnikernelsChanges in End Point DensityServers as EndPointsAppMicro-Servers(Processes) as EndPointsAppBinBinOSMulti-ApplicationBare MetalGrowth in the Number of EndpointsUnikernels, also know as “virtual library operating system”

Why does the increase in endpoints matter?Scale and Flexibility of Forwarding Tables will be stressedPath TableIPv4/v6 Unicast FIBVPN / Prefix / Mask / Paths / Offset1 / 10.1.3.0 / 24 / 1 / 53 / 10.1.2.0 / 24 / 2 / 63 / 10.1.3.0 / 24 / 2 / 6Hashing1 / 10.1.2.0 / 24 / 4 / 1Rewrite InformationPath 1ADJ 1 - Rewrite SRC A DST A MACPath 2ADJ 2 - Rewrite SRC A DST B MACPath 3ADJ 3 - Rewrite SRC A DST C MACPath 4ADJ 4 - Rewrite SRC A DST D MACPath 1ADJ 5 - Rewrite SRC A DST E MACPath 1ADJ 6 - Rewrite SRC A DST F MACPath 2ADJ 7 - Rewrite SRC A DST G MACADJ 8 - Rewrite SRC A DST H MACADJ 9 - Rewrite SRC A DST I MACADJ 10 - Rewrite SRC A DST J MAC

NFE (Trident 2) Unified Forwarding Table NFE has a 16K traditional LPM TCAM table. Additionally NFE has the following Unified Forwarding Table for ALPM (Algorithm LPM) Mode NFE has dedicated adjacency table (48K)DedicatedL2 MAC Entries:32k x 105 bits4k x 420 bitsbank-04k x 420 bitsbank-1SUPPORTED COMBINATIONSShared Entries:256k x 105 bitsDedicatedL3 Host Entries:16k x 105 bits16k x 420 bitsbank-216k x 420 bitsbank-316k x 420 bitsModeL2L3 HostsLPM0288K16K0bank-41224K56K016k x 420 bitsbank-52160K88K0396K120K01k x 420 bitsbank-6432K16K128K1k x 420 bitsbank-71k x 420 bitsbank-81k x 420 bitsbank-928

ASE2, ASE3 & LSETile Based Forwarding TablesInitial Lookup for LPMChained Lookups forAdjacency & MACChained Lookup for ECMPEntries for the Route8K x 104 8K x 104bitsbits8K x 104 8K x 104bitsbits8K x 104 8K x 104bitsbits8K x 104 8K x 104bitsbits8K x 104 8K x 104bitsbits8K x 104 8K x 104bitsbits8K x 104 8K x 104bitsbits8K x 104 8K x 104bitsbits Improve flexibility by breaking the lookup table into small re-usable portions, “tiles” Chain lookups through the “tiles” allocated to the specific forwarding entry type IP LPM, IP Host, ECMP, Adjacency, MAC, Multicast, Policy Entry e.g. Network Prefix chained to ECMP lookup chained to Adjacency chained to MAC Re-allocation of forwarding table allows maximised utilisation for each node in the network Templates will be supported initially29

Example Templates – Nexus 9200 (ASE-2) Host Heavy Template e.g. Aggregation for smaller K16K8K32K8K8K8K

Example Templates – Nexus 9200 (ASE-2) Balanced Host and Routes e.g. Aggregation for Classical L2/L3 16K8K8K32K8K8K

N9200 (ASE-2) Initial TCAM TemplatesForwarding TableLPM HeavyIPv4 Prefix(LPM)16K16K256KIPv6/64 Prefix16K16K256KIPv6 /128 Prefix8K8K128KIPv4 host routes112K256K32KIPv6 host routes48K192K16KMAC96K16K16KHostHost HeavyLPMFCSHost / MAC balanced

ASE2, ASE3 & LSE OptimisationDifferent Lookup Table Layouts and N9300EX/X9700EX1.28T/ 1 sliceN31003.2T / 4 slicesN3200On ChipIPv4 Prefix (LPM)256K*256K*750K*192K*128K*192KIPv6/64 Prefix (LPM)256K*256K*750K*84K*84K*64KIPv6 Prefix /128(LPM)128K*128K*384K*20K*20K*64KIPv4 host routes256K*256K*750K*120K*104K*750KIPv6 host *288K*136K*750KHostASE3LPMASE2

Hyper-Converged Fabrics Introduces the SameScaling Problem for Segmentation and SWEBVXLAN 2VLAN 1APPWEBVLAN 3PRODOVS/OpFlexDBVMBasic DC NetworkSegmentationSegment by ApplicationLifecycleNetwork centricSegmentationPer Application-tier /Service LevelMicro-SegmentationIntra-EPGContainer SecurityMicro-SegmentationLevel of Segmentation/Isolation/Visibility

Fabric Wide SegmentationMulti-tenancy at SUBNETSUBNETBasic icro-Segmentation at ScaleApplicationLifecycleSegmentationMacro Segmentation at ScaleMacro Segmentation2K VRF 6K TCAM140K Security Policies per switch16K VRF per switchASE-2, ASE-3, LSELSE

Consistent Campus Security – DC PolicyEnforcementISE Policy DomainPolicyFederationAPIC Policy DomainIPSecCMD/SGTGBP VXLAN(GBP ID SGT)SXPTrustSecBorder Router(ASR1K, ASR9K)SGT - EPGMappingACISpineEPG

Real-time Flow SensorsASE-3 & LSEHardware Sensors in ASICsSensor Data (Examples) Hash table to store flows (5-tuple) and the related stats Flow table is exported from the ASICperiodically through UDP tunnels Capable to capture all flows, selectivecapture possible Stats have two modes Concise (64K flows/ slice): byte and packetcount, start and end timestamps Detailed (32K flow/ slice): concise andanalytics informationCapture predefined anomalies TTL changed Anomalous TCP flags seen (xmas flags, syn & rst,syn & fin, ) Header fields inconsistencies Anomalous seq/ack numbersCapture standard errors IP length error, tiny fragment etcMeasure burstiness in the flow Capture the size and time window when the maxburst was seen in the flow Correlate burst arrival time across multiple flows ins/w to infer microburst

Fabric Wide TroubleshootingReal Time Monitoring, Debugging and AnalysisGranular Fabric Wide FlowMonitoring Delivering DiagnosticCorrelationDebugUnderstand ‘what’ and ‘where’ for dropsand determine application impactMonitorTrack Latency (avg/min/max), bufferutilisation, network eventsFlowMonitoringAnalyseSpecific events and suggest potentialsolution (e.g. trigger automatic rollback)FlowMonitoringApplication Trafficoperating within SLAImpacted ApplicationTraffic

Pervasive NetFlow at Scale‘If you can’t see it, you can’t secure it’Customer AsksCisco SolutionTop Talker AnalysisBusiness Critical vs. Best EffortSecurity TelemetryCollect all data everywhere inthe network every packet, everyflow, every switchFabric Wide Trouble-shootingProtects customers’ NetFlowinvestment:On demand & full historyCapacity planningHotspot Detection, Trending10/25/40/100GLine rateIndustry First: Built-In NetFlow Capability across Leaf & Spine

VXLAN & Fabric Design RequirementsHost-based ForwardingVTEPVTEPVXLAN, MPLSVTEPVTEPVTEPVTEPSpine – No VTEP RequiredCollapsed Border Spine – VTEP RequiredVXLAN OverlayEVPN MP-BGP or ACIBorder LeafVXLAN to VXLANVTEPVXLANVTEPVLANVTEPVTEPVTEPAnyCast GatewayVTEPMulti-ProtocolBorder LeafVXLAN, MPLS, dot1qVTEPVTEP

VXLAN SupportGateway, Bridging, Routing*VXLAN to VLANBridging(L2 Gateway)VXLAN to VLANRouting(L3 Gateway)Egress interface chosen(bridge may .1Q tag the packet)VLANORANGEVXLANORANGEIngress VXLAN packet onOrange segmentVXLAN L2GatewayEgress is a tagged interface.Packetis routed to the new VLANVLANBLUEVXLANORANGEIngress VXLAN packet onOrange segmentVXLANRouterDestination is in another segment.Packet is routed to the new segmentVXLAN to VXLANRouting(L3 Gateway)VXLANORANGEIngress VXLAN packet onOrange segmentVXLANBLUEVXLANRouter

VxLAN Routing – Trident 2 VxLAN Routing (forwarding into, out of or between overlays) is not supported in the native pipeline onTrident 2 During phase 1 of the pipeline lookup the packet the lookup leverages the same station table toidentify if the packet is destined to the default GW MAC (switch MAC) or if the packet is anencapsulated packet with the local TEP as the terminating tunnel (either ‘or’ operation) If the packet is encapsulated and the tunnel terminates on the switch the phase 2 portion of thelookup the internal packet header can not be resolved via the FIB but only via the L2 station stable(limitation of T2 implementation) The internal packet can not be routed after de-encap, similar pipeline limitation prevents a packet thatis routed then being encapsulated and have that encapsulated packet ayloadOuterUDPInitial pipeline can only resolve either this packet is destinedto default GW MAC or destined to this tunnel endpointVXLANFCSInnerEthernetInnerIPPayloadSecond phase lookup can operate only against the L2 station table iftunnel terminates on this switchNewFCS

VxLAN to VLAN Routing – Trident 2VxLAN routed mode via loopback is possible, packet is de-encapsulated, forwarded out througha loopback (either Tx/Rx loopback or via external component), on second pass the match for‘my router’ MAC results in L3 lookup and subsequent forward via L2 VLANMatch against this TEP netInnerIPNewFCSPayloadTrident BCMVxLANSubnet netInnerIPPerform a FIB lookup when DMAC This RouterVLANSubnet 10.20.20.0/24PayloadFCS

VxLAN Routing – Trident 2Considerations Leveraging loopback through Trident 2 will consume twice the I/O and Core BW as packetsare forwarded through the ASIC twiceVXLAN to VXLAN routing will consume 3x the I/O and Core BWNeed to understand the ratio of I/O to lookups in cases where recirculation is required10G of I/ORecirculate consumesequivalent I/O capacityto initial packet (2x I/OBW is required)10G of I/OEncap/Decap10G of I/ORoutePacket10G of I/ORecirculate consumes

This intermediate level session will describe the Cisco Nexus 9000 architecture and innovations in terms of hardware, software, mechanical design, optical . ACI leaf and spine Flow Table (Netflow, ) ACI feature and service and security enh