Optimizing Nvidia Virtual Gpu For The Best Vdi User Experience

Transcription

OPTIMIZING NVIDIA VIRTUAL GPUFOR THE BEST VDI USER EXPERIENCEErik Bohnhorst, NVIDIA

vGPU Introduction Virtual GPU October 2018 (vGPU 7.0)Agenda Architecting for Best User Experience NVIDIA Recommended CPUs What is the right GPU for your use case NVIDIA vGPU Benchmarking2

THE EVOLUTION OF MODERN WORKFLOWSVISUALWORKSPACEMOBILITYCOLLABORATIONLARGE DATAINTERACTIVEHPCAIPHOTOREALISMVRVISUAL COMPUTING SPECTRUM3

HOW IT WORKSNVIDIA virtual GPU technology delivers a GPU experience to every desktopCPU Only VDILimiting UserExperienceWith NVIDIA Virtual GPUDriving the Best User Experience acrosssimple to the most powerful AppsApps and VMsNVIDIA Graphics DriversApps and VMsNVIDIA Virtual GPUHypervisorNVIDIA Virtualization SoftwareHypervisorServerNVIDIA Tesla GPUServer4

VIRTUAL GPU OCTOBER 2018 (vGPU 7.0)Unprecedented Performance & ManageabilityFPOMulti-vGPU SupportWorld’s Most PowerfulQuadro vDWSFPOvMotion Support for vGPU* Tesla T4 support coming with vGPU software 7.1 releaseLive Migration of vGPU enabled VMsQuadro vDWS & GRIDNGC with vGPUAvailable with vGPUQuadro vDWSTesla T4 GPU Support*Latest Generation TuringQuadro vDWS5

GIANTGreatest LeapSinceLEAP2006 CUDA GPUPASCALTURING11.8 Billion xtors 471 mm2 24 GB 10GHz18.6 Billion xtors 754 mm2 GDDR6 14GHz6

Greatest LeapSinceLEAP2006 CUDA GPUGIANTTENSOR CORESHADER COMPUTEFPorINTRT COREFP16INT8INT4Giga Rays/SecSHADER COMPUTEFP INTPASCALTURING7

VIRTUAL GPU9

TESLA T4 IS EXTREMELY VERSATILEEnablement in Virtual GPU 7.1 Great solution forTESLA T4GPU Quadro vDWS GRID vPCCores Deep Learning InferenceMemory6 boards in high volume 2U rackserversForm FactorThermalPowerMax UsersComputeMemory Bandwidth1x TU1042,560 CUDA Cores320 Turing Tensor CoresRT Cores16 GB GDDR6PCIe 3.0 Single Slot(half height & length)Passive70W – no external power16 (1GB FB)65 FP16 TFLOPS130 INT8 TOPS240 INT4 TOPS320 GB/s10

94% FASTER RENDERING USING MULTI-GPUSOLIDWORKS Visualize (Iray) Render Time“The flexibility of the new multi-GPU feature available withNVIDIA Quadro vDWS opens up powerful new renderingworkflows to SOLIDWORKS Visualize users. The near linearperformance scaling means they can iterate on their designsat lightning speed on professional virtual workstations,allowing our customers to arrive at their best design in theshortest amount of time.” – Brian Hillner, SOLIDWORKSProduct Portfolio Manager94% Faster1x Tesla V1002x Tesla V100Up to 94% Faster Render Time Using Multi-GPU SOLIDWORKS Visualize (IRAY)Tests were run on a server with 2x Intel Xeon Gold (6154 3.0 GHz) CPUs, 512GB RAM, RHEL 7.5, NVIDIA QuadrovDWS software, Tesla V100-32Q, Driver - 410.39, 256 GB RAM, Windows 10 x64 RS311

UP TO 4.95X FASTER THAN CPU-ONLYAbaqus/Standard 2018 Elastomeric Bearing ModelAbaqus with NVIDIA Quadro vDWS & Tesla V100-32Q16 vCPU Quadro vDWSwith 2x V10016 License Tokens16 vCPU Quadro vDWSwith 1x V10016 License Tokens32 vCPU21 License Tokens16 vCPU16 License Tokens011X2X230 3X44X5X5Tests run on a sever with 2x Intel Xeon Skylake CPUs (Xeon 6148 2.4 GHz 32-core), NVIDIA Quadro vDWS software, Tesla V100 GPUs with 32Q profile, Driver - 410.53, 256 GB vRAM, Cent OS 7.4 64-bit. BenchmarkModel: 450-550 TFLOPs, 5.9M DOF, Highly Nonlinear Static, Axisymmetric model with non-axisymmetric loading and twist, Direct Sparse Solver (Model courtesy: SIMULIA)12

NVIDIA VIRTUAL GPU SOFTWARE LINEUPGRID VirtualApplicationsQuadro Virtual DataCenter WorkstationGRID Virtual PCFor professional graphics applications;includes an NVIDIA Quadro driver.For virtual desktops delivering standardPC applications, browser, andmultimedia.Use with VMware Horizon Apps.Recommended GPU:Tesla P4*Recommended GPU:Tesla M10Recommended GPU:Tesla M10(Quadro vDWS)* P40 & V100 for High End & Ultra High-End Use Cases* P6 for blade form factor deployments(GRID vPC)(GRID vApps)13

NVIDIA RECOMMENDEDCPU OPTIONSDifferent workflows requiredifferent CPUsGRID vPCIntel Xeon Gold 6148- 24 Cores @ 2.4 GHzAMD EPYC 7501- 32 Cores @ 2.0 GHzBoth CPUs provide similar user experience* while the AMD CPUcan host 25-33% more usersGRID vPCAMD EPYC CPU’s higher number of physical cores withlower frequency provide similar user experience toIntel Xeon Gold 6148 at higher scale.Quadro vDWSIntel Xeon Gold 6154Quadro vDWS- 18 Cores @ 3.0 GHz 3.0 GHz is required for many professionalapplications for optimal performance. Lowerfrequency can result in degraded performance.Provides the required frequency per physical core and allowsgood scale (18 cores/CPU)* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames14

RECOMMENDED NVIDIATESLA GPU OPTIONSQuadro vDWSIntel Xeon Gold 6154 NVIDIA TESLA P4* Quadro vDWSDifferent workflows requiredifferent GPUsGRID vPCQuadro vDWS:Tesla P4 GPUs with Quadro vDWS for entry to midend users provides the most flexible and costeffective solutionTesla P40/V100 GPUs with Quadro vDWS providesgraphics acceleration for few ultra high end usersGRID vPC:Tesla M10 GPUs with GRID vPC enhances userexperience while being the most cost effectivesolutionIntel Xeon Gold 6148 or AMD EPYC 7501 NVIDIA TESLA M10** GRID vPCGRID vAppsIntel Xeon Gold 6148 NVIDIA TESLA M10** GRID vApps* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs** Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames15

QUADRO vDWS GUIDANCEDeep learning, rendering, immersive visualization,and GPGPU compute applicationsTesla V100Largest CAD models, CAE,Photorealistic rendering,Seismic exploration, GPGPU compute 4GB MemoryTesla P40High-End Quadro vDWSLarge/complex CAD models,Seismic exploration, complexDCC effects, 3D Medical Imaging ReconLarge/complex CAD models,Advanced DCC, Medical ImagingMedium size/complexity CAD models,Basic DCC, Medical Imaging, PLMTesla P4 / T4Small/simple CADmodels, video, EntryPLMEntry – Mid Range Quadro vDWSOffice, SketchupPACS/DiagnosticsSchlumberger, Halliburton, DeltaGen, Catia Live RenderingAutoCAD, Revit, InventorAnsys, Abaqus, SimuliaSolidworks, Siemens NX, Creo, CatiaAdobe CC Photoshop, Illustrator16Adobe CC Premiere Pro, After Effects, Autodesk Maya, 3ds Max, Mari, Nuke

NVIDIA recommends Intel Xeon Gold 6154 18-core3.0 GHz which provides enough CPU resources tohost 6x Tesla P4 GPUs with Quadro vDWS.3dsMaxTesla P4 benefits over Tesla M60:Performance*-Price/Performance-Smaller Form Factor-Lower Power Consumption-NVIDIA Pascal GPU Architecture BenefitsCatiaMayaSiemens NX24 VMs4 VMs24 VMs4 VMs12 VMs2 VMs4 VMs0.524 VMs10-6x Tesla P41.512 VMsBest Density with 6x Tesla P4on a Dual-Socket Server1x Tesla P42 VMsUP TO 6X TESLA P4Relative PerformanceSufficient CPU resources to host6x Tesla P4 with Quadro vDWSSolidworksTesla P4 and Tesla M60 Performance*Relative PerformanceTesla M60 GPUTesla P4 GPU1.210.83dsMaxCatiaMaya* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUsSiemens NXSolidworks17

Tesla P4 and Tesla P40 Performance*Tesla P40 and Tesla V100 powerthe most demanding workflowsTesla P40 with Quadro vDWS for few high to ultra high endusers.Relative PerformanceTESLA P40/V100 FORULTRA HIGH END USERSTesla P4Tesla P402.521.510.5Tesla V100 with Quadro vDWS for few high to ultra highend users and/or Deep Learning workflows.When to choose Tesla P40 over P4:-Maximum Performance*-High Framebuffer profiles (12GB/24GB)Multiple Tesla P4 GPUs are the most cost effective andflexible solution for many entry to mid range end usersTESLA P4TESLA P40Many Low-Mid End UsersFew Mid-High End UsersPrice/PerformanceREMEMBER: PerformanceLarge framebuffer GPUsdon’tguaranteeHighFramebufferProfilesForm Factorhigh number of Quadro (12GBvDWSandusers24GB)Power ConsumptionDifferent Profiles (Many P4s)* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using SPECviewperf 12.1 on a Dell PowerEdge R740 with 2x Intel Xeon Gold 6154 CPUs18

SIMPLE vGPU LICENSE DECISIONDo you use CUDA orprofessional workstation apps?Yes!You need Quadro vDWSYes!You need GRID vPCYes!You need vAppsIncludes vPC and vApps entitlementNoDo you use VDI?(Single user per OS)Includes vApps entitlementNoDo you have multiple userssharing a single OS throughsessions? (RDSH, Horizon Apps,XenApp, etc.)19

HOW NVIDIA MEASURES USER EXPERIENCEApplying Methodology of Physical PCs to Virtual PCsUNIQUELY Quantifies Remote User ExperienceMetricsEnd UserLatencyFramerateUXImage QualityDescriptionEnd User LatencyMeasures Interactivity, how remote your session feelsFramerateMeasures the fluidity of your sessionImage QualityMeasures the impact of the remote protocolFunctionalityApplication and API compatibilityConsistencyMeasures how consistent the UX is over time Monitors Resource UtilizationFunctionalityMetricsDescriptionHost ResourcesCPU, GPU, Memory, etc.Virtual Machine ResourcesvCPUs, vMemory, vGPU, IOPS, etc.Network ConsumptionBandwidth, etc. Realistic Sizing Recommendations20

NVIDIA IMAGE QUALITY RECOMMENDATIONGRID vPCQuadro vDWSYUV 4:4:4 for PC UsersYUV 4:2:0 for Workstation UsersReferenceImageReferenceImageYUV 4:2:0YUV 4:4:4YUV 4:2:0YUV 4:4:421

YUV 4:4:4 IMPLICATIONSImage Quality64 VM Bandwidth (Mbit/sec)1Remoted Frames/User2001180Total Remoted .9291%0.750.50.252000.9YUV 4:2:0YUV 4:4:4Improved Image QualitySSIM increase to 0.989**YUV 4:2:0YUV 4:4:4Similar Bandwidth Utilization*YUV 4:4:4 – 2% less bandwidth0YUV 4:2:0YUV 4:4:4Lower Remoted Frames*YUV 4:4:4 - 9% fewer Remoted Frames* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload running 64 VMs with Tesla M10-1B** Tested with NVIDIA Cirrus VDI Benchmarking tool and predefined reference images to represent multiple workflows22

GRID vPC for Multiple ScreensEnd User Latency (ms)1x 1080p Screen(CPU-Only)350Server CPU Utilization (%)Remoted Frames / User1008060140200GRID vPC for High Screen ResolutionsEnd User Latency (ms)1x 1080p Screen(CPU-Only)350Server CPU Utilization (%)Remoted Frames / User1008060140200* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames23

GRID vPC for Multiple ScreensEnd User Latency (ms)1x 1080p Screen(CPU-Only)Server CPU Utilization (%)Remoted Frames / User100800802x 1080p Screens(CPU-Only)1.260350140200GRID vPC for High Screen ResolutionsEnd User Latency (ms)7141x 1080p Screen(CPU-Only)1x 4K Screens(CPU-Only)Server CPU Utilization (%)Remoted Frames / User10080603501.26140200* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames24

GRID vPC for Multiple ScreensEnd User Latency (ms)1x 1080p Screen(CPU-Only)Server CPU Utilization (%)Remoted Frames / User1008001.85802x 1080p Screens(CPU-Only)4513501.2601402x 1080p Screens(GRID vPC)200GRID vPC for High Screen ResolutionsEnd User Latency (ms)7141x 1080p Screen(CPU-Only)1x 4K Screens(CPU-Only)1x 4K Screens(GRID vPC)Server CPU Utilization (%)Remoted Frames / User1001.8680603502331.26140200* Tested with NVIDIA’s Cirrus VDI Benchmarking tool using the Knowledge Worker workload and comparing End-User Latency and Remoted Frames25

THANK YOU26

Tests were run on a server with 2x Intel Xeon Gold (6154 3.0 GHz) CPUs, 512GB RAM, RHEL 7.5, NVIDIA Quadro vDWS software, Tesla V100-32Q, Driver - 410.39, 256 GB RAM, Windows 10 x64 RS3. 12 UP TO 4.95X FASTER THAN CPU-ONLY . Office, Sketchup Adobe CC Photoshop, Illustrator Adobe CC Premiere Pro, After Effects, Autodesk Maya, 3ds Max, Mari, Nuke