Fire And Ice: How Temperature Affects GPU Performance

Transcription

Fire and Ice: How TemperatureAffects GPU PerformanceDanny PriceHarvard-Smithsonian Center for Astrophysicsdprice@cfa.harvard.eduCfA: Danny Price, Ben Barsdell, Lincoln GreenhillNVIDIA: Mike Clark, Ron BabichGTC 2014 MARCH 24-27 SAN JOSE, CA

Introduction Watts,leakage & performance GPUsin radio astronomy Experimental Resultssetup& conclusionsGTC 2014 MARCH 24-27 SAN JOSE, CA

The Green 500Source: http://www.green500.org/lists/green201311GTC 2014 MARCH 24-27 SAN JOSE, CA

The Green 500#1#2TSUBAME-KFC!Wilkes Cluster!Tokyo Institute of TechnologyOil-cooledUniversity of CambridgeAir-cooledSource: http://www.green500.org/lists/green201311GTC 2014 MARCH 24-27 SAN JOSE, CA

The Square Kilometer ArrayImage: www.skatelescope.orgGTC 2014 MARCH 24-27 SAN JOSE, CA

Radio Interferometry - lots of computationsGTC 2014 MARCH 24-27 SAN JOSE, CA

Radio Interferometry - lots of computations More antennas lots more baselinesfor example, NA 512 NB 130816 Computations done on a frequencychannel basis Trivially parallelizable: over frequencyand baselineGTC 2014 MARCH 24-27 SAN JOSE, CA

xGPU - a GPU correlator for radio astronomyxGPU is a GPU-based cross correlatorfor radio astronomy.!It powers LEDA (the world’s largest Nant radiotelescope), PAPER, and the MWA (among others).M. A. Clark, P. C. La Plante, and L. J. Greenhill, "Accelerating RadioAstronomy Cross-Correlation with Graphics Processing units",[arXiv:1107.4264 [astro-ph]].!Authors:!Michael Clark (NVIDIA)Paul La Plante (Loyola University Maryland)Lincoln Greenhill (Harvard-Smithsonian Center for Astrophysics)David MacMahon (University of California, Berkeley)Ben Barsdell (Harvard-Smithsonian Center for Astrophysics)The LEDA-512 correlatorGTC 2014 MARCH 24-27 SAN JOSE, CA

Watts, leakage & performanceLeakage power (bad)GTC 2014 MARCH 24-27 SAN JOSE, CA

Watts, leakage & performanceSwitching frequencyTransistor capacitanceVoltageGTC 2014 MARCH 24-27 SAN JOSE, CA

Watts, leakage & performance Leakage mechanisms:Reverse-biases junction leakageGate-induced drain leakageGate direct-tunneling leakageSubthreshold (weak inversion) leakageGTC 2014 MARCH 24-27 SAN JOSE, CA

Watts, leakage & performance Leakage mechanisms:Reverse-biases junction leakageGate-induced drain leakageGate direct-tunneling leakageSubthreshold (weak inversion) leakageGTC 2014 MARCH 24-27 SAN JOSE, CA

How temperature affects performanceGTC 2014 MARCH 24-27 SAN JOSE, CA

How temperature affects performanceThermal voltage8.62 10 5 eV/K26 mV at room temperatureGTC 2014 MARCH 24-27 SAN JOSE, CA

How temperature affects performanceDevice specifics:A: Technology specific constantL, W: device channel length & widthThermal voltage8.62 10 5 eV/K26 mV at room temperatureGTC 2014 MARCH 24-27 SAN JOSE, CA

How temperature affects performanceDevice specifics:Voltage terms:A: Technology specific constantVs: Gate to source voltageL, W: device channel length & widthVth: Switching threshold voltagen: transistor subthreshold swing coeffThermal voltage8.62 10 5 eV/K26 mV at room temperatureGTC 2014 MARCH 24-27 SAN JOSE, CA

Results Part 0:Green GPU: A water-cooled test rigGTC 2014 MARCH 24-27 SAN JOSE, CA

Green GPU: A water-cooled test rig GreenGPU specifications: Gigabyte GA-Z68MX motherboardIntel Core i7-260016GB RAMNVIDIA Tesla K20mEK-FCTK20 water blockGTC 2014 MARCH 24-27 SAN JOSE, CA

Warranty void: K20m from a HP serverGTC 2014 MARCH 24-27 SAN JOSE, CA

Warranty void: K20m from a HP serverGK110GK110 die image: news.softpedia.comGTC 2014 MARCH 24-27 SAN JOSE, CA

Green GPUGTC 2014 MARCH 24-27 SAN JOSE, CA

Green GPUGTC 2014 MARCH 24-27 SAN JOSE, CA

Green GPUTuna CanGTC 2014 MARCH 24-27 SAN JOSE, CA

The experiment RunxGPU in benchmark mode RunSMI tool in loop MonitorGPU die temp and powerGTC 2014 MARCH 24-27 SAN JOSE, CA

NVIDIA System Management InterfaceGTC 2014 MARCH 24-27 SAN JOSE, CA

NVIDIA System Management InterfaceGTC 2014 MARCH 24-27 SAN JOSE, CA

NVIDIA SMI — GPU Boost on K20m Can control “boost clock” through nvidia-smi:nvidia-smi –q –d SUPPORTED CLOCKS!!nvidia-smi –ac MEM clock, Graphics clock Changing boost clock changes voltage levels(more on this later) Undervolting uses less power, but becomesunstable at higher temperatures Overvolting uses more power, but is morestable at higher temperatures. There is a big“overclocker” community that exploits this. V4 is the K20m default. Note max power(TDP) is 225W.900-1000 mV925-1050 mV987.5-1112.5 mV950-1062.5 mV912-1025 mVGTC 2014 MARCH 24-27 SAN JOSE, CA

How to really void your read.php?284014-KGB-Kepler-BIOS-Editor-UnlockerGTC 2014 MARCH 24-27 SAN JOSE, CA

Don’t try this at homeGPU-Z(www.techpowerup.com/gpuz/)Kepler Bios Tweaker v1.27(overclock.net forums)GTC 2014 MARCH 24-27 SAN JOSE, CA

Overclocking and xGPU performaceGTC 2014 MARCH 24-27 SAN JOSE, CA

Turned on water pumpTurned off water pumpDrifting temperature, drifting powerGTC 2014 MARCH 24-27 SAN JOSE, CA

Results: clock & voltage controlGTC 2014 MARCH 24-27 SAN JOSE, CA

Measured power vs temperatureV V4GTC 2014 MARCH 24-27 SAN JOSE, CA

Measured power vs temperature-14W 14WV V4GTC 2014 MARCH 24-27 SAN JOSE, CA

Measured power vs temperature-14W 14WV V4GTC 2014 MARCH 24-27 SAN JOSE, CA

Measured power vs temperature-14W 14WV V4GTC 2014 MARCH 24-27 SAN JOSE, CA

How temperature affects performanceV V4GTC 2014 MARCH 24-27 SAN JOSE, CA

How voltage affects performancef 800 MHzGTC 2014 MARCH 24-27 SAN JOSE, CA

Undervolted, different clock frequenciesV V1GTC 2014 MARCH 24-27 SAN JOSE, CA

Heavily undervoltedDATA !INVALIDGTC 2014 MARCH 24-27 SAN JOSE, CA

Green GPU and the Test Equity 1000 Chamber of DoomTemperature range:-100F to 350F(-73C to 175C)GTC 2014 MARCH 24-27 SAN JOSE, CA

Green GPU and the Test Equity 1000 Chamber of DoomorThermal shock: how to really really void your warrantyTemperature range:-100F to 350F(-73C to 175C)GTC 2014 MARCH 24-27 SAN JOSE, CA

Green GPU and the Test Equity 1000 Chamber of DoomorThermal shock: how to really really void your warrantyGreengpu, case torn apart to fitTemperature range:-100F to 350F(-73C to 175C)GTC 2014 MARCH 24-27 SAN JOSE, CA

Overclocked & Undervolted926 MHz clockvoltage v1GTC 2014 MARCH 24-27 SAN JOSE, CA

Overclocked & Undervolted926 MHz clockvoltage v1GTC 2014 MARCH 24-27 SAN JOSE, CA

Part III: Comparisons & ConclusionsGTC 2014 MARCH 24-27 SAN JOSE, CA

Kepler vs. Fermi2902P 243 0.148 T 0.00397 TP 235 0.548 TPower (watts)280926 MHz clockvoltage v12702602502402030405060oTemperature ( C)70Tesla K20mGeforce GTX 580GK110, 28nmGF110, 40nm8090GTC 2014 MARCH 24-27 SAN JOSE, CA

Maxwell results (preliminary)GTX 750 Ti 640 CUDA Cores1020 MHz base Clock (MHz)1085 MHz boost Clock (MHz)60W TDPxGPU 1062 GFLOP/s performance76% efficiency of codeexpect 80% efficiency withcode optimizationalready 16 GFLOPS/Wcan’t control clock / voltage (yet )GTC 2014 MARCH 24-27 SAN JOSE, CA

Conclusions 12.96 — 18.34 GFLOPs/W, i.e. 41% increaseby overclock undervolt on K20m for xGPU. Maxwell is 16 GFLOPs/W for xGPU out of the box!GTC 2014 MARCH 24-27 SAN JOSE, CA

Conclusions xGPU: a useful CUDA benchmark and “burn-in” tool Lower temperature lower power consumption Overclocking w/o overvolting good Overclocking unvervolting better Tesla K20m is pretty hard to brickGTC 2014 MARCH 24-27 SAN JOSE, CA

Acknowledgements RobertKimberk (CfA lab tech) Overclockingcommunity NVIDIACfA: Danny Price, Ben Barsdell, Lincoln GreenhillNVIDIA: Mike Clark, Ron Babichdprice@cfa.harvard.eduGTC 2014 MARCH 24-27 SAN JOSE, CA

It powers LEDA (the world's largest Nant radio . Fire and Ice: How Temperature Affects GPU Performance Author: Danny Price Subject: Is it worth cooling your GPUs, or should you run them hot? In this session, we discuss how operating temperature affects the computational performance of GPUs. Temperature-dependent leakage current effects .