AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION

Transcription

AMD RYZEN PROCESSORSOFTWAREOPTIMIZATIONPRESENTED BY KENNETH MITCHELLLET’S BUILD 2020MAY 15, 2020

ABSTRACTJoin AMD Game Engineering team members for anintroduction to the AMD Ryzen family ofprocessors followed by advanced optimizationtopics. Learn about the high-performance AMD "Zen2" microarchitecture and profiling tools. Gain insightinto code optimization opportunities and lessonslearned. Examples may include C/C , assembly,and hardware performance-monitoring counters.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 2

SPEAKER BIOGRAPHYKen Mitchell is a Principal Member of Technical Staffin the Radeon Technologies Group/AMD ISVGame Engineering team where he focuses onhelping game developers utilize AMD processorsefficiently. His previous work includes automatingand analyzing PC applications for performanceprojections of future AMD products as well asdeveloping benchmarks. Ken studied computerscience at the University of Texas at Austin.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 3

AGENDA Success Stories“Zen 2” Architecture ProcessorsAMD uProf ProfilerOptimizations and LessonsLearned ContactsAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 4

SUCCESS STORIESAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 5

SUCCESS STORIESBORDERLANDS 3GEARS 5WORLD WAR ZDirectX 12DirectX 12Vulkan AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 6

“ZEN 2” ARCHITECTUREPROCESSORSAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 7

“ZEN 2” PRODUCT EXAMPLESNOTEBOOKDESKTOPHIGH END DESKTOP“Renoir”“Matisse”“Castle Peak”AMD Ryzen 7 4800U 8-CoreProcessorAMD Ryzen 9 3950X 16-CoreProcessorAMD Ryzen Threadripper 3990X 64-Core ProcessorAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 8

MICROARCHITECTUREAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 9

ADVANCES IN “ZEN 2” MICROARCHITECTUREBRANCHPREDICTION32K ICACHE8 wayOP CACHEDECODE4 instructions/cycleMicro-op Queue8 ops/cycle6 ops dispatchedFLOATINGPOINTFloating Point RenameINTEGERInteger Rename 15% IPC Improvement from “Zen” to “Zen 2” 2x op cache capacity Reoptimized L1I cache 3rd address generation unit 2x FP data path width 2x L3 capacity Improved branch prediction accuracy Hardware optimized Security MitigationsSche Sche Sche Sche Sche Sche Scheduler duler duler duler duler duler dulerSchedulerInteger Physical Register FileFP Register File Secure Virtualization with Guest Mode Execute Trap (GMET)ALU ALU ALU ALU AGU AGU AGUMUL ADD MUL ADD Improved SMT fairness for ALU and AGU schedulers512K L2(I D)CACHE8 way Improved Write Combining Buffer2 loads 1 store LOAD/STOREQUEUESper cycle32K DCACHE8 wayAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 10

DATA FLOWAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 11

“RENOIR” 8 CORE PROCESSOR32B fetch2*32B load1*32B store le512K L232B/cycleI D Cache8-way32B/cycle4M L3I D ler uclk32B/cycle 32B/cycle l3clk 64B/cycle fclkGFX9MediaIO HubController lclkAMD Ryzen 7 4800U, 15W TDP, 8 Cores, 16 Threads, 4.2 GHz max boost clock, 1.8 GHz base clock, integrated GPU.* Monolithic Die. Each 4M L3 Cache has its own 32B/cycle link to the data fabric. 64b DDR4 Channel Shown.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 1216B/cycleDRAMChannel memclk

CCD“MATISSE” 16 CORES PROCESSOR32B fetch2*32B load1*32B store le512K L232B/cycleI D Cache8-way32B/cycle16M L332B/cycle RI D Cache16-way 16B/cycle WUnifiedMemoryController uclkDataFabric64B/cycle l3clk fclkAMD Ryzen 9 3950X, 105W TDP, 16 Cores, 32 Threads, 4.7 GHz max boost clock, 3.5 GHz base clock.* Two Core Complex Die (CCD). Each CCD has two 16M L3 Cache Complexes.* The L3 Cache Complexes within a CCD share a single link to the Data Fabric.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 13IO HubController lclkCCDIOD16B/cycleDRAMChannel memclk

“CASTLE PEAK” 64 CORE PROCESSOR32B fetch2*32B load1*32B store le512K L232B/cycleI D Cache8-way32B/cycle16M L332B/cycle RUnifiedMemoryController uclk16B/cycleDataFabricI D Cache16B/cycleWQuadrant16-way64B/cycle l3clk fclkIO HubController lclkAMD Ryzen Threadripper 3990X, 280W TDP, 64 Cores, 128 Threads, 4.3 GHz max boost clock, 2.9 GHz base clock.* Two CCDs per Data Fabric Quadrant.* Two Data Fabric Quadrants have Unified Memory Controllers and two have IO Hubs.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 140 14 5IOD2 36 7DRAMChannel memclk

INSTRUCTION SETAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 15

EXAMPLEYEAR FAMILY PRODUCT FAMILYARCHITECTURE PRODUCT2019 17h“Matisse”“Zen2”Ryzen 9 3950X2017 17h“Summit Ridge”, ”Pinnacle Ridge” “Zen”, “Zen ” Ryzen 7 2700X2015 15h“Carrizo”, “Bristol Ridge”“Excavator”A12-98002014 15h“Kaveri”, “Godavari”“Steamroller” A10-7890K2012 15h“Vishera”“Piledriver”FX-83702011 15h“Zambezi”“Bulldozer”FX-81502013 16h“Kabini”“Jaguar”A6-14502011 14h“Ontario”“Bobcat”E-4502011 MXOPINSTRUCTION SET 11000000111000100111000000“Zen 2” added CLWB and the AMD vendor specific instruction WBNOINVD.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 00000001111000001110000001111000

SOFTWARE PREFETCH LEVEL INSTRUCTIONSPrefetch T0 T1 T2 NTAL1L2AggressivelyEvictPrefetchNTA Loads a cache line from the specified memory address into the datacache level specified by the locality reference T0, T1, T2, or NTA. If a memory fault is detected, a bus cycle is not initiated and theinstruction is treated as an NOP. Prefetch levels T0/T1/T2 are treated identically in “Zen” & “Zen 2”microarchitectures. The non-temporal cache fill hint, indicated with PREFETCHNTA,reduces cache pollution for data that will only be used once. It is notsuitable for cache blocking of small data sets. Lines filled into the L2cache with PREFETCHNTA are marked for quicker eviction from theL2 and when evicted from the L2 are not inserted into the L3. The operation of this instruction is implementation-dependent.Prefetch fill & evict policies may differ for other processor vendors ormicroarchitecture generations.L3memoryAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 17

“MATISSE” CACHE AND MEMORYAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 18

CACHE LATENCYCount/LevelCapacity Sets Ways Line Size LatencyCCDuopL1I884 K uops32 KB6464888 uops64 B Cache line size is 64 Bytes. 2 cpu clock cycles to move a single cache line. L2 is inclusive of L1. lines filled into L1 are also filled into L2. L3 is filled from L2 victims of all 4 cores within its CCX. L2 tags are duplicated in its L3 for fast cache transfers within aCCX. L2 capacity evictions may cause L3 capacity evictions. “Matisse” products may have 1 or 2 CCDs. Each CCD Core Complex Die (CCD) may have two CCX. CCX: Core Complex (4 Cores, 8 Logical Processors, 16MB).NA4 clocksL1D832 KB64864 B4 clocksL2U8512 KB1024864 B12 clocksL3U216 MB163841664 B39 clocksAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 19

REFILL WITHIN SAME CCXCCX0CCX1CCX2CCX3CCM0CCM1CCM2CCM3 Refills within the same CCX may be relatively low cost! Some operating system schedulers are CCX aware. CCX: Core Complex (4 Cores, 8 Logical Processors, 16MB). IFOP: Infinity Fabric On-Package. CCM: Cache-Coherent Master has the memory map. SDF Transport Layer: Scalable Data Fabric TransportLayer. CS: Coherent Slave responsible for cache coherency. Electrical interface between chiplets not shown. UMC: Unified Memory Controller.SDF Transport LayerCS0CS1UMCUMCDDR4DDR4AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 20

REFILL FROM LOCAL DRAMCCX0CCX1CCX2CCX3CCM0CCM1CCM2CCM3 Minimize refills from local DRAM. CCX: Core Complex (4 Cores, 8 Logical Processors, 16MB). IFOP: Infinity Fabric On-Package. CCM: Cache-Coherent Master has the memory map. SDF Transport Layer: Scalable Data FabricTransport Layer. CS: Coherent Slave responsible for cache coherency. Electrical interface between chiplets not shown. UMC: Unified Memory Controller.SDF Transport LayerCS0CS1UMCUMCDDR4DDR4AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 21

REFILL FROM LOCAL DRAMCCX0CCX1CCX2CCX3CCM0CCM1CCM2CCM3 Minimize refills from local DRAM. CCX: Core Complex (4 Cores, 8 Logical Processors, 16MB). IFOP: Infinity Fabric On-Package. CCM: Cache-Coherent Master has the memory map. SDF Transport Layer: Scalable Data FabricTransport Layer. CS: Coherent Slave responsible for cache coherency. Electrical interface between chiplets not shown. UMC: Unified Memory Controller.SDF Transport LayerCS0CS1UMCUMCDDR4DDR4AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 22

REFILL FROM ANY OTHER CCXCCX0CCX1CCX2CCX3CCM0CCM1CCM2CCM3 Refill from any other CCX cost may be similar to memorylatency. CCX: Core Complex (4 Cores, 8 Logical Processors, 16MB). IFOP: Infinity Fabric On-Package. CCM: Cache-Coherent Master has the memory map. SDF Transport Layer: Scalable Data FabricTransport Layer. CS: Coherent Slave responsible for cache coherency. Electrical interface between chiplets not shown. UMC: Unified Memory Controller.SDF Transport LayerCS0CS1UMCUMCDDR4DDR4AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 23

REFILL FROM ANY OTHER CCXCCX0CCX1CCX2CCX3CCM0CCM1CCM2CCM3 Refill from any other CCX cost may be similar to memorylatency. CCX: Core Complex (4 Cores, 8 Logical Processors, 16MB). IFOP: Infinity Fabric On-Package. CCM: Cache-Coherent Master has the memory map. SDF Transport Layer: Scalable Data FabricTransport Layer. CS: Coherent Slave responsible for cache coherency. Electrical interface between chiplets not shown. UMC: Unified Memory Controller.SDF Transport LayerCS0CS1UMCUMCDDR4DDR4AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 24

AMDUPROF PROFILERAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 25

NEW IN V3.2THREAD CONCURRENCYFLAME GRAPHSYMBOLSScaled chart for Threadripper Sorted call stacksImproved symbol path supportAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 26

7-zip 19.00 x64 benchmark “7z.exe b” shown.Testing done by AMD technology labs, February 9, 2019 on thefollowing system. Test configuration: AMD Ryzen Threadripper 3970X Processor, AMD Wraith Ripper Cooler, 64GB (4 x 16GBDDR4-3200 at 22-22-22-52) memory, Radeon RX 580 GPU withdriver 20.1.3 (January 17, 2020), 2TB M.2 NVME SSD, AMDRyzen Reference Motherboard, Windows 10 x64 build 1909,1920x1080 resolution. Actual results may vary.You may need to run AMDuProf as administrator to see thisadvanced option.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 27

Horizontal: normalized inclusive samples.Vertical: call stackColor: Module NameAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 28

AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 29

AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 30

AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 31

I recommendthis profile.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 32

I recommendEnabling Call StackSampling (CSS) with FramePointer Omission (FPO) forFlame Graph Analysis.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 33

A 5 second delay may allowyou to change theforeground window beforeprofiling starts.I often collect 30 or 60seconds of samples.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 34

Enable loading from the Microsoft Symbol Server –especially if you have not defined NT SYMBOL PATHAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 35

OPTIMIZATIONS AND LESSONSLEARNEDAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 36

TOPICS General GuidanceUse Best Practices with ScalabilityVerify Parallel DX12 Pipeline StateCreationVerify Parallel DX12 Command ListGenerationUse Best Practices with LocksReorder Hot Struct MembersUse Prefetch Level while iteratingstd::vector T* AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 37

GENERAL GUIDANCEAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 38

USE THE LATEST COMPILER & SDK “Zen 2” recommended compiler flags: /GL /arch:AVX2 /MT /fp:fast /favor:blendGuidance for “Zen 2” and subject to change.Use /favor:blend and NOT /favor:amd64.JeMalloc may benefit some applications. See http://jemalloc.net/YearVisual Studio ChangesAMD Products(implicit)2019Additional SIMD intrinsics optimizations including constant-folding and arithmeticsimplifications. Build throughput improvements. New -Ob3 inlining option. Memcpy &Memset optimizations.“Pinnacle Ridge”2017Update v15.9.14 and later may improve AMD Ryzen memcpy/memsetperformance.Improved code generation of loops: Support for automatic vectorization of division ofconstant integers, better identification of memset patterns. Added Cmake support. Addedfaster database engine. Improved STL & .NET optimizations. New /Qspectre option.“Summit Ridge”2015Improved autovectorization & scalar optimizations. Faster build times with/LTCG:incremental. Added assembly optimized memset & memcpy using ERMS & SSE2.“Kaveri”AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 39

USE A SUPPORTED INSTRUCTION SETInstruction Set Supported100%AMD Processor100%100%%Radeon Users90% Using /arch:AVX or /arch:AVX2 may improve code gen ofinline code. memcpy & memset may be inline if the length is knownat compile time. AVX is supported on many systems and growing over time. AVX512 is not supported by AMD processors and waspresent on less than 1% of users with Intel processors. Source: AMD User Experience Program Users Surveyincluding 4 Million systems sampled from January 2019 toOctober 2019.Intel SE2AVXAVX20%AVX512Instruction SetAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 40

USE ALL PHYSICAL CORES This advice is specific to AMD processors and is notgeneral guidance for all processor vendors. Generally, applications show SMT benefits and use ofall logical processors is recommended. However, games often suffer from SMT contention onthe main or render threads during gameplay. One strategy to reduce this contention is tocreate threads based on physical core countrather than logical processor count. Profile your application/game to determine theideal thread count. Recommend game options to: Set Max Thread Pool Size Force Thread Pool Size Force SMT Force Single NUMA Node (implicitly Group) Avoid setting thread pool size as a constant. See ws/// This advice is specific to AMD processors and is// not general guidance for all processor vendorsDWORD get default thread count() {DWORD cores, logical;get processor count(cores, logical);DWORD count logical;char vendor[13];get cpuid vendor(vendor);if (0 strcmp(vendor, "AuthenticAMD")) {if (0x15 get cpuid family()) {// AMD "Bulldozer" family microarchitecturecount logical;} else {count cores;}}return count;}AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 41

DISABLE DEBUG FEATURES BEFORE YOU SHIP While investigating open issues, developers may submit change requests which enable debug features onTest and Shipping configurations. These debug features may greatly reduce performance due to disablingmulti-threading, cache pollution from STATS, and increased serialization from logging. Some Unreal Engine settings to verify include: Build.h #define FORCE USE STATS and #define STATS See h Parallel Rendering CVARS See ist.cppSee msr.rhithread.enableRecommended Value0111AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 42

USE BEST PRACTICES WITHSCALABILITYAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 43

OPTIMIZE SCALABILITY FOR INTEGRATEDGRAPHICS Goal: 60 FPS Average at 720p 100% Very Low Try: Use DXGI FORMAT R11G11B10 FLOAT rather thanDXGI FORMAT R16G16B16A16 FLOAT Reduce shadow map quality Reduce volumetric fog quality Disable Ambient Occlusion For Unreal Engine r.SceneColorFormat r.AmbientOcclusionLevels 0AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 44

USE PROPER VIDEO MEMORY BUDGET FOR APU AGS SDK 5.4 Added isAPU flag. If true, set the video memory budget tosharedMemoryInBytes for APU (AMD AcceleratedProcessing Unit with integrated graphics). If false, set the video memory budget tolocalMemoryInBytes for discrete GPU. Example: unsigned long long memory budget (device.isAPU)? es; See ideo-memory-reporting-apus/AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 45

VERIFY PARALLEL DX12 PIPELINESTATE CREATIONAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 46

VERIFY PARALLEL DX12 PIPELINE STATECREATION Game shows parallel DX12 Pipeline State Creation. Performance of binary compiled with: Microsoft Visual Studio 2019 v16.4.5. UnrealEngine-4.24.2-release fromhttps://github.com/EpicGames/UnrealEngine Windows (64-bit) Packaged Project “Infiltrator Demo” from EpicGames Store; uct/infiltrator-demo Testing done by AMD technology labs, February 18, 2020 on thefollowing system. Test configuration: AMD Ryzen 9 3950X Processor,AMD Wraith Prism Cooler, 16GB (2 x 8GB DDR4-3200 at 22-22-22-52)memory, Radeon VII GPU with driver 20.1.4 (January 24, 2020), 2TBM.2 NVME SSD, AMD Ryzen Reference Motherboard, Windows 10x64 build 1909, 1920x1080 resolution. Actual results may vary.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 47

UE4.24 PARALLELIZED DX12 PIPELINE STATECREATION me/D3D12RHI/Private/D3D12Pipelinestate.cpp#L488AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 48Hello Parallelism!

TEST COLD SHADER CACHE Using a cold shader cache maysimplify verifying ifD3D12.dll!CDevice::CreatePipelineState was called in parallel.rmdir /s /q "%LOCALAPPDATA%\D3DSCache"rmdir /s /q "%LOCALAPPDATA%\AMD\DxCache"rmdir /s /q "%LOCALAPPDATA%\AMD\VkCache"rmdir /s /q "%LOCALAPPDATA%\AMD\GLCache"Install the Windows SDK Windows Performance Toolkit. Add theGPUView folder to the PATH.call log.cmdApplications and games may varyconfigurations of shader caches ondisk yielding different results.start InfiltratorDemo.exe -dx12Results may vary based on GPUvendor & driver versions used.pushd "C:\WindowsNoEditor"popdtimeout.exe /t 10call log.cmdAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 49

Cold shader cache shown.Add CPU Usage (Precise).Add Flame Graph, Find allD3D12.dll!CDevice::CreatePipelineState.See parallelism highlighted in CPU Usage(Precise). This is easiest to find using a coldshader cache.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 50

Warm shader cache shown.See parallelism highlighted in CPU Usage(Precise).AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 51

VERIFY PARALLEL DX12 COMMANDLIST GENERATIONAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 52

VERIFY PARALLEL DX12 COMMAND LISTGENERATION Game shows parallel DX12 Command List Generation. Performance of binary compiled with: Microsoft Visual Studio 2019 v16.4.5. UnrealEngine-4.24.2-release fromhttps://github.com/EpicGames/UnrealEngine Windows (64-bit) Packaged Project “Infiltrator Demo” from EpicGames Store; uct/infiltrator-demo Testing done by AMD technology labs, February 13, 2020 on thefollowing system. Test configuration: AMD Ryzen 9 3950X Processor,AMD Wraith Prism Cooler, 16GB (2 x 8GB DDR4-3200 at 22-22-22-52)memory, Radeon VII GPU with driver 20.1.4 (January 24, 2020), 2TBM.2 NVME SSD, AMD Ryzen Reference Motherboard, Windows 10x64 build 1909, 1920x1080 resolution. Actual results may vary.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 53

Run:InfiltratorDemo.exe -dx12Run as admin:timeout.exe /t 5call log.cmdtimeout.exe /t 3call log.cmdOpen merged.etl using theWindows PerformanceAnalyzer.Zoom to single frame usingPresent markers.Move CPU Column next toTask Name then filter andexpand CommandList.AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 54

USE BEST PRACTICES WITH LOCKSAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 55

SUMMARY Use modern OS synchronization APIs Recommended: std::mutex std::shared mutex SRWLock EnterCriticalSection May allow more efficient scheduling and longerbattery lifeAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 56

SUMMARY Otherwise, for user spin locks: Use the pause instruction Alignas(64) lock variable Test and test-and-set Avoid lock prefix instructions The OS may be unaware that threads are spinning;scheduling efficiency and battery life may be lost Use spin locks only if held for a very short timeAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 57

“ZEN 1” PERFORMANCEElapsed Time (ms)Ryzen 7 2700XLocks Benchmark(less is 00800,000600,000400,000200,000-6921%105% 100% 100% 100% 101% 108% Binaries compiled using best practices show improvedperformance. Performance of binary compiled with Microsoft Visual Studio2019 v16.4.5. Testing done by AMD technology labs, February 13, 2020 onthe following system. Test configuration: AMD Ryzen 72700X Processor, AMD Wraith Prism Cooler, 16GB (2 x 8GBDDR4-2667 at 20-19-19-43) memory, Radeon RX 5700 XTGPU with driver 20.1.4 (January 24, 2020), 512GB M.2 NVMESSD, AMD Ryzen Reference Motherboard, Windows 10x64 build 1909, 1920x1080 resolution. Actual results may vary.binaryAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 58

“ZEN 2” PERFORMANCEElapsed Time (ms)Ryzen 7 3700XLocks Benchmark(less is 2,00021,000119%110%100% 100% 100% 100% 101% “Zen 2” improved SMT fairness for ALU schedulers. This helps mitigate bad user spin lock code . Binaries compiled using best practices show improvedperformance. Performance of binary compiled with Microsoft Visual Studio2019 v16.4.5. Testing done by AMD technology labs, February 13, 2020 onthe following system. Test configuration: AMD Ryzen 73700X Processor, AMD Wraith Prism Cooler, 16GB (2 x 8GBDDR4-3200 at 22-22-22-52) memory, Radeon RX 5700 XTGPU with driver 20.1.4 (January 24, 2020), 512GB M.2 NVMESSD, AMD Ryzen Reference Motherboard, Windows 10x64 build 1909, 1920x1080 resolution. Actual results may vary.binaryAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 59

“ZEN 2” PERFORMANCE%idleRyzen 7 3700XLocks Benchmark(higher is %91%91% Binaries compiled using best practices shows improved idle. Performance of binary compiled with Microsoft Visual Studio2019 v16.4.5. Testing done by AMD technology labs, February 13, 2020 onthe following system. Test configuration: AMD Ryzen 73700X Processor, AMD Wraith Prism Cooler, 16GB (2 x 8GBDDR4-3200 at 22-22-22-52) memory, Radeon RX 5700 XTGPU with driver 20.1.4 (January 24, 2020), 512GB M.2 NVMESSD, AMD Ryzen Reference Motherboard, Windows 10x64 build 1909, 1920x1080 resolution. Actual results may vary.93%binaryAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 60

PROFILING Use AMD uProf to find possible user spin locks AMD uProf v3.2 "Assess Performance (Extended)" Event Based Sampling Profile ALUTokenStall PTI 3K Per Thousand Instructions is bad for top functions Replace user spin locks with modern OS synchronization APIs when possible. Otherwise, use best practices. Use Microsoft Windows Performance Analyzer to find call stacks using OS synchronization APIs rem Recommend using public Microsoft symbol server rem NT SYMBOL PATH srv*http://msdl.microsoft.com/download/symbols rem “–start gpu –start video” wpr profiles are useful for game analysis for short durations wpr.exe –setprofint 1221 –start power –filemode test.exe wpr.exe –stop log.etlAMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 61

EXAMPLES SHARED de#include#includehigh resolution clock::time point t0 \high resolution clock::now();"intrin.h""stdio.h""windows.h" chrono numeric thread mutex shared mutex for (size t i 0; i num threads; i) {threads[i] CreateThread(NULL, \0, ThreadProcCallback, NULL, 0, NULL);}WaitForMultipleObjects(num threads, \threads, TRUE, INFINITE);#define LEN 512alignas(64) float b[LEN][4][4];alignas(64) float c[LEN][4][4];int main(int argc, char* argv[]) {using namespace std::chrono;float b0 (argc 1) ? strtof(argv[1], NULL) : 1.0f;float c0 (argc 2) ? strtof(argv[2], NULL) : 2.0f;std::fill((float*)b, (float*)(b LEN), b0);std::fill((float*)c, (float*)(c LEN), c0);int num threads std::thread::hardware concurrency();HANDLE* threads new HANDLE[num threads];}high resolution clock::time point t1 \high resolution clock::now();duration double time span \duration cast duration double (t1 - t0);printf("time (milliseconds): %lf\n", \1000.0 * time span.count());delete[] threads;return EXIT SUCCESS;AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 62

EXAMPLE 1 BAD USER SPIN LOCKnamespace MyLock {typedef unsigned LOCK, * PLOCK;enum { LOCK IS FREE 0, LOCK IS TAKEN 1 };void Lock(PLOCK pl) {while (LOCK IS TAKEN \InterlockedCompareExchange( \pl, LOCK IS TAKEN, LOCK IS FREE)) {}Warning! Not best}practices for spin lock.void Unlock(PLOCK pl) {InterlockedExchange(pl, LOCK IS FREE);}}alignas(64) MyLock::LOCK gLock;DWORD WINAPI ThreadProcCallback(LPVOID data) {alignas(64) float a[LEN][4][4];std::fill((float*)a, (float*)(a LEN), 0.0f);float r 0.0;for (size t iter 0; iter 100000; iter ) {MyLock::Lock(&gLock);for (int m 0; m LEN; m )for (int i 0; i 4; i )for (int j 0; j 4; j )for (int k 0; k 4; k )a[m][i][j] b[m][i][k] * c[m][k][j];r std::accumulate((float*)a, \(float*)(a LEN), 0.0f);MyLock::Unlock(&gLock);}printf("result: %f\n", r);return 0;}AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 63

Warning! 3K ALUToken Stalls PTI!AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 64

Warning! 26K ALUToken Stalls PTI!AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 65

Warning! Nopause instructionin spin loop! AMD PUBLIC LET’S BUILD 2020 AMD RYZEN PROCESSOR SOFTWARE OPTIMIZATION MAY 15, 2020 66

EXAMPLE 2 IMPROVED USER SPIN LOCKnamespace MyLock {typedef unsigned LOCK, * PLOCK;enum { LOCK IS FREE 0, LOCK IS TAKEN 1 };void Lock(PLOCK pl) {while ((LOCK IS TAKEN *pl) \(LOCK IS TAKEN \InterlockedExchange(pl, LOCK IS TAKEN))) {mm pause();Good! Applied best}practices for spin loc

ADVANCES IN “ZEN 2” MICROARCHITECTURE 32K ICACHE 8 way DECODE Micro-op Queue 4 instructions/cycle 512K L2 (I D)CACHE 8 way Integer Rename Integer Physical Register File LOAD/STORE QUEUES 32K DCACHE 8 way Floating Point Rename Scheduler FP Register File MUL ADD MUL ADD Sch