Intel Optane Data Center Persistent Memory

Transcription

Intel Optane Data CenterPersistent MemoryArchitecture (Jane) and Performance (Lily)Presenters: Lily Looi, Jianping Jane XuCo-Authors: Asher Altman, Mohamed Arafa, Kaushik Balasubramanian, Kai Cheng, Prashant Damle, Sham Datta, ChetDouglas, Kenneth Gibson, Benjamin Graniello, John Grooms, Naga Gurumoorthy, Ivan Cuevas Escareno, TiffanyKasanicky, Kunal Khochare, Zhiming Li, Sreenivas Mandava, Rick Mangold, Sai Muralidhara, Shamima Najnin, Bill Nale,Jay Pickett, Shekoufeh Qawami, Tuan Quach, Bruce Querbach, Camille Raad, Andy Rudoff, Ryan Saffores, Ian Steiner,Muthukumar Swaminathan, Shachi Thakkar, Vish Viswanathan, Dennis Wu, Cheng Xu08/19/2019

Notices & DisclaimersSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, aremeasured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult otherinformation and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For morecomplete information visit http://www.intel.com/benchmarks .Configurations on slides 18 and 20.Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on systemconfiguration.Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.Performance results are based on testing as of Feb. 22. 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or componentcan be absolutely secure.Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in yoursystem hardware, software or configuration may affect your actual performance.Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more completeinformation about performance and benchmark results, visit http://www.intel.com/benchmarks .Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and providecost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data areaccurate.All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.*Other names and brands may be claimed as property of others.Intel, the Intel logo, Xeon, the Xeon logo, Optane, and the Optane logo are trademarks of Intel Corporation in the United States and other countries. Intel Corporation.

1.Intel Optane DCPersistent MemoryArchitectureA Breakthrough with a New InterfaceProtocol, Memory Controller, Media,and Software Stack

Memory-Storage GapLLCL2L1Corepico-secsnano-secsCPUDRAMMemory Sub-SystemAccess DistributionHOT TIER?Hot dataWARM TIERless often Data Access FrequencyMemory-Storage GapSSDCooler data more often10s GB 100nanosecsSSD10s TB 100microsecsIntel 3D Nand SSDHDD / TAPECOLD TIERNetworkStorage10s TB 100millisecs

Close Memory – Storage GapLLCL2pico-secsnano-secsL1CoreCPUDRAM10s GB 100nanosecsMemory Sub-SystemAccess DistributionHOT TIEROptimize performance given cost and powerbudgetMove Data Closer to ComputeMaintain PersistencyHot dataWARM TIERless often Data Access Frequency1s TB 10microsecsSSDCooler data more often100s GB 1microsecSSD10s TB 100microsecsIntel 3D Nand SSDHDD / TAPECOLD TIERNetworkStorage10s TB 100millisecs

Intel Optane Media TechnologyCross-Point StructureSelectors allow dense packingAnd individual access to bitsHigh Resistivity – ‘0’Low Resistivity – ‘1’BreakthroughMaterial AdvancesCompatible switch and memorycell materialsAttributes Non-volatilePotentially fast writeHigh densityNon-destructive fast readLow voltageIntegrate-able w/ logicBit alterableFirst Generation Capacities:128 GB256 GB512 GBHighPerformanceScalableMemory layers can bestacked in a 3D mannerCell and array architecturethat can switch fast

Intel Optane DC Persistent Memory Module ArchitectureDRAM12NVMNVMNVMNVMOptaneNVM3CMD and Address BusBuffBuffBuffBuffBuffBuffDDR4 ConnectorDQ Buffer and Logic RailData BusOptaneControllerNVM Bus4NVM RailsPMICPMIC BusSMBus/I2CsMedia6Mem Ctrl RailsSPDFlush I/F1.2.3.4.5.6.Flash5DQ buffers presents a single load to the hostHost SMBus: SPD visible to the CPU, Optane Controller plays thermal sensing (TSOD) functionalityAddress Indirection TableIntegrated PMIC controlled Optane ControllerOn DIMM Firmware storageOn-DIMM Power Fail Safe with auto-detection

AddrMappingCacheAddress MappingLogicDRNGKey MgmtEncrypt/DecryptOptaneTM Media DevicesSchedulerReadQueueWriteQueueUctrlMedia ManagementPower gLogicOptane TM Media InterfaceDDR4 Slot onHost CPUInterface toHost CPUDCPMM Memory InterfaceIntel Optane DC Persistent Memory Controller ArchitectureMediaChannelCaps forFlushes

Intel Optane DC Persistent Memory SW Enabling StackfileManagement UIStandardRaw DeviceAccessApplicationApplicationStandardFile APIStandardFile APIUSER SPACEManagement LibraryApplicationmemoryLoad/Store“DAX”Generic NVDIMM DriverPersistent Memorypmem-AwareFile SystemMMUMappingsKERNEL SPACEFile System

Key Feature Deep Dive - ADRAsynchronous DRAM Refresh (ADR) - Power Fail Protected FeatureCPUL1PCHADRTimer3ADR SYNC4PCU5ADR CompleteADR GPIO2Core7MCADR TriggerWPQ6Platform Logic1PWROKPlatform PowerSupplyL11. AC power loss to de-assert the PWROKL22. Platform logic then asserts the ADR TriggerL33. PCH starts the ADR programmable timer4. PCH assertion to SYNC message5. PCU in processor detects SYNC message bit andsends AsyncSR to MC6. MC flushes Write pending queue (WPQ)NVMNVMNVMIntel NVMNVMOptane DC PMM7. After ADR timer expires, PCH assertsADR COMPLETE pin

Memory Mode Large Memory Capacity No software/application changes required To mimic traditional memory, data is “volatile” Volatile mode key cleared and regenerated every powercycleMOVCoreL1CoreL1L1L2L1Used as a write-back cache Managed by host memory controller Within the same host memory controller, not across Ratio of far/near memory (PMEM/DRAM) can varyOverall latency Same as DRAM for cache hit DC persistent memory DRAM for cache missL1L2DRAM is ‘near memory’ CoreCoreL1L1L2L1L2L3Memory ControllerNEAR MEMORYMemory ControllerNEAR MEMORYDRAMDRAMFAR MEMORYFAR MEMORY

App Direct Mode – Persistent MemoryPMEM-aware software/application required CoreL1Load/store access, no page cachingCache CoherentAbility to do DMA & RDMAL1L2L3No paging, context switching, interrupts,nor kernel code executesByte addressable like memory Industry open standard programming model and IntelPMDKMOVIn-place persistence Adds a new tier between DRAM and block storage(SSD/HDD)CPU CACHES MemoryControllerDram MEMORYDRAMSW makes sure thatdata is flushed to durabilitydomainusingCLFLUSHOPT or CLWBWPQ PersistentMEMORYMinimum requiredpower fail protecteddomain:Memory subsystem

2.PERFORMANCEIntel Optane DC PersistentMemory for larger data, betterperformance/ , and new paradigms

Intel Optane DCPersistent Memory LatencyIntel DC P4610NVMe SSDLower is betterIntel OptaneDC SSDP4800X1000x lower latencySmallergranularity(vs. 4K)Read idle latencyNote 4K granularity gives about same performance as 256BFor more complete information about performance and benchmark results, visit www.intel.com/benchmarks

Intel Optane DC Persistent Memory LatencyPerformance can vary based on 64B random vs. 256B granularity Read/write mix Intel DC P4610Power level (programmable 12-18W, graph is 18W) NVMe SSDIntel Optane DCSSD P4800XSmallergranularity(vs. 4K)Read idle latencyRanges from 180ns to 340nsRead/Write(64B)Read/Write (256B)Read (64B)(vs. DRAM 70ns)For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.Read (256B)

Memory Mode Transaction FlowDRAM Cache Hit Cache hit: latency same as DRAML1 L1L1Cache miss: latency DRAM Intel Optane DCpersistent memoryL2Performance varies by workload CoreCoreL1L2CPU CACHESGood locality means near-DRAM performanceCPU CACHES DRAM Cache MissL3L3Best workloads have the following traits: Good locality for high DRAM cache hit rate Low memory bandwidth demandMemory ControllerMemory ControllerOther factors: #reads #writes Config vs. Workload sizeNEAR MEMORYNEAR MEMORYDRAMDRAMFAR epare-for-the-next-generation-of-memoryFAR MEMORY

Memory Mode Performance vs. Locality & Load Synthetic traffic generator represents different types of workloads Vary size of buffers to emulate more or less locality Very large data size (much larger than DRAM cache) causes higher miss rateMLC Bandwidth Varying Bandwidth Demand(100% reads)120.0High BW delivered for highdemand WLBandwidth (GB/s)100.080.0High demand poor locality %DRAM cache miss rateLight Demand (13GB/s)Med (33GB/s)45%Medium/low demand WLs stillmeet requirementHeavy Demand (max)For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.17

Memory Mode Performance/Load/Locality2-2-2System PlatformNeon cityCPUCLX-B0CPU per Node28core/socket, 1 socket, 2 threads per coreMemory6x 16GB DDR 6x 128GB AEP QSSUT OSFedora 4.20.6-200.fc29.x86 64BKCWW08BIOSPLYXCRB1.86B.0576.D20.1902150028 (mbf50656 0400001c)FW01.00.00.5355SecurityVariants 1,2, & 3 PatchedTest Date4/5/2019MLC parameters: --loaded latency –d varies -t200Buffer size (GB) per threadMiss rate (%) 0 10 25 402-2-2R0.11.04.59.0W20.10.71.84.5

Enable More Redis VM Instances with Sub-ms SLA2 VMs per core21 VM per core1. One Redis Memtier instance per VM2. Max throughput scenario, will scale better at lower operating pointVM sizeDRAM baseline MM capacity45GB768GB1TB90GB768GB1TBThroughput vs.DRAMSummary42% more VM’s @lowercost42% more VM’s@lower147%, meets SLAcost111%, meets SLAVM’s14- 207- 10Throughput: Higher is better, Latency: lower is better (must be 1ms or less)For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.19

Redis ConfigurationConfiguration 1 - 1LMConfiguration 2 – Memory Mode (2LM)Test byIntelIntelTest date02/22/201902/22/2019PlatformNeoncityNeoncity# Nodes11# Sockets22CPUIntel Xeon Platinum 8276, 165WIntel Xeon Platinum 8276, 165WCores/socket, Threads/socket28/5628/56HTOnOnBIOS versionPLYXCRB1.86B.0573.D10.1901300453BKC version – E.g. ww47PLYXCRB1.86B.0573.D10.1901300453WW06AEP FW version – E.g. 53365346 (QS AEP)5346 (QS AEP)System DDR Mem Config: slots / cap / run-speed12 slots / 32GB / 266612 slots / 16 GB / 2666System DCPMM Config: slots / cap / run-speed12 slots / 32GB / 266612 slots /128GB,256GB,512GB/ 2666Total Memory/Node (DDR, DCPMM)768, 0192, 1TB,1.5TB,3TB, 4-200.fc29.x86 644.20.4-200.fc29.x86 64AEP mode: ex. MM or AD-volatile (replace DDR) or ADpersistent (replace NVME)1LMMemory Mode (2LM)Workload & versionRedis 4.0.11Redis 4.0.11Other SW (Frameworks, Topologies )memtier benchmark-1.2.12 (80/20read/write) ; 1K record sizememtier benchmark-1.2.12(80/20 read/write) ; 1K record sizeVMs (Type, vcpu/VM, VM OS)KVM, 1/VM, centos-7.0KVM, 1/VM, centos-7.0WW06

App Direct Mode Transaction Flow1 Software2 4K transfer from disk3 Request returned 1L1 L1L1L1L2L2App Direct access memory directly CoreCoreCPU CACHESTraditional read to page fault (disk):App Direct ReadCPU CACHES Traditional Read to Page FaultL3L3Avoids software and 4K transferoverheadMemory ControllerCores can still access DRAM normally,even on same channelMemory Controller3Onecacheline2NEAR MEMORYNEAR MEMORYDRAMDRAMFAR MEMORYSSD

Redis Example (with Software Change) Reduce TCO by moving large portion of data from DRAM to Intel Optane DC persistent memoryOptimize performance by using the values stored in persistent memory instead of creating a separate copy of the log in SSD (onlypointer written to log) Direct access vs. disk protocolWrite requestWrite requestStore key/valueAppendoperation intoAOF log fileAppendpointer intoAOF log fileStore keyStore valueAOFPointers(to log)AOF file (log)PMEMMoving Value to App Direct reduces DRAM and optimizes logging by 2.27x(Open Source Redis Set, AOF always update, 1K datasize, 28 instances)For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.22

Spark SQL OAP cache Intel Optane DC persistent memory as cache More affordable than similar capacity DRAM Significantly lower overhead for I/O intensive workloadsConfigurationQuery time768GB DRAM1417s192GB DRAM1TB App Direct171s8X improvement in Apache Spark* sql IOintensive queries for Analytics3TB scale factorFor more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Summary Intel Optane DC Persistent Memory closes the DDR memory and storage gap Architected for persistence Provided large capacity scales workloads to new heights Offered a new way to manage data flows with unprecedented integration into system and platform Optimized for performance and orders of magnitude faster than NAND Memory mode for large affordable volatile memory App Direct mode for persistent memory

Data Access Frequency Cooler data more often Hot data DRAM HOT TIER HDD / TAPE COLD TIER SSD WARM TIER Optimize performance given cost and power budget CPU LLC Core L2 L1 pico-secs nano-secs Memory Sub-System 10s GB 100nanosecs Network Storage SSD 10s TB 100millisecs 10s TB 100microsecs Move Data Closer to Compute Maintain Persistency 100s .