How To Lie With Statistics*

Transcription

BenchmarkingHow to Lie with Statistics**Darrell Huff, How to Lie with Statistics, Norton, New York, 1954

The only reliable way to measure performance is byrunning actual applications on real hardware.If we want to compare performanceacross different contexts, this impliesuse of a benchmark.

Standard Performance EvaluationCorporationhttp://www.spec.org/Many benchmarks, most commonly CPUCPU2006 / 2000 / 95 / 92[published results]Choice of integer or floating pointEach is a suite (12 integer, 17 floating point)C, C , Fortran, statically compiled & linked

SPEC CINT 2006BenchmarkBrief DescriptionBased on Perl V5.8.7. The workload includes SpamAssassin, MHonArc email400.perlbenchindexer, and specdiff401.bzip2Julian Seward's bzip2 version 1.0.3, modified to work in memory403.gccgcc V 3.2 targeting an AMD Opteron429.mcfNetwork simplex public transport scheduler445.gobmkPlays the game of Go456.hmmerProtein sequence analysis using profile hidden Markov models458.sjengChess program that also plays several variants462.libquantum Simulates a quantum computer464.h264refH.264/AVC video compression471.omnetpp OMNet discrete event simulator modeling an Ethernet network473.astarPathfinding library for 2D maps, including A* search483.xalancbmk A modified version of Xalan-C , for transforming XML

SPEC CFP2006 Part 1Benchmark410.bwaves416.gamess433.milcBrief Description3D transonic viscous flowQuantum chemistryLattice gauge field generator434.zeusmpAstrophysics CFD (computational fluid amdMolecular dynamicsEinstein equation solverLarge eddy CFDBiology molecular dynamics

SPEC CFP2006 Part ief DescriptionFinite element analysisSimplex linear algorithmRay tracingStructural analysisSolves 3D Maxwell equationsQuantum chemistry w/ OO FortranLattice Boltzmann fluid flow simulationWeather modelSpeech recognition

SPEC HistoryH&P Fig. 1.16Note how fewpersist formultiplegenerations

Typical CINT SummaryCompany and modelDates

Typical CINT SummaryWhat they quote in marketing material

Typical CINT SummaryWhat naive people think is more realisticWhat’s the difference?

Base Rules1. No naming benchmarks or routines2. No library substitution3. No feedback-directed optimizations4. Only safe optimizations5. Same optimizations for all6. No assertions to guide optimization

Base vs PeakBase sounds more realisticPeak is “no holds barred, anything goes”So why is it naive to think base is more meaningful?Need to look deeper

Individual ResultsRun each benchmark three times, divide each runby a reference time (so higher score is better), usemedian values to compute summary average ofratios. Sounds reasonable.

GraphicallyWhat’s upwith this?Note how many arebelow the “average”

How to Average?The usual way (arithmetic mean)The SPEC way (geometric mean)Both are sensitive to outliersA little effort to improve one benchmark yields a much better average overall

Another AverageWhen averaging ratios, harmonic mean yields a value proportional to the totalShort-running applications have less influence on total timeHarmonic mean is less sensitive to outliers

Example

Using Harmonic MeanNow half areabove mean21.224.6

Omitting the OutlierAbout 12%difference20.023.0

How Common is This?

Are any Different?

How About SPEC FP?

So?If they all do it, aren’t the numbersmeaningful in a relative sense?

So?Consider this example:23.625.0

So?How does deleting the outlier and using the harmonic mean change the results?23.620.325.019.8

“Benchmark Engineering”There are obvious ways to enhance performance using the SPEC CPUpeak rules:Profile directed feedbackSpecial librariesUnsafe optimizationsDifferent optimization optionsAssertions to guide optimizationWhat else can you think of?

“Benchmark Engineering”Single user/diagnostic modeStrip-down kernel to minimum servicesDisable network interface, user I/OLengthen OS quantumHand pick processor board and memoryUse fastest disk (15K RPM or SSD)Reformat disk with longer sectorsMake compiler recognize benchmarksTurn off multithreadingSpecially cool processor chip

“Benchmark Engineering”Commercial benchmarks reportresults that you are guaranteednever to exceed (or even match)

Amdahl’s LawGene AmdahlArchitect for IBM 709, Stretch, 360Left IBM to form his own company, building IBM mainframe “clones”Observed that speeding up one aspect of an architecture has limited value

Amdahl’s LawOverall Speedup 1/((1-Percent affected) Percent affected/Speedup)Even if X% of a processor’s performance is improved infinitely, only Xamount is removed from the totalThe remaining 1-X% dominatesIf 99% disappears, 1% remains, so at most 100X speedup

DesikanValidation of software simulation of architectureCompares real Alpha to simulationsIdentifies sources of error with microbenchmarksShows results with macrobenchmarks

Simulator Error

Simulator Error

Discussion

Hill CAECW 2002Commercial workloads are differentBig memory and diskNondeterminismBenchmarks run for hours

OLTPDatabase benchmarkReduce sizeZero think timeSuper-fast disk10K transaction warm-up (real machine), 1K run (sim)

SPECjbbTransaction processing in Java1.8GB heap to minimize GC500MB data per warehouse100K warmup, 100K run

Apache10 SURGE clients per processorZero think time2K file repository with 50 MB80K warmup, 2.5K run

SlashcodeDynamic web page generation3K messages, 5 MB total240 transactions warmup, 50 run

Barnes-HutN-body SimulationNumerical benchmark for comparison64K bodies

The Workload

Variation

Discussion

How to Lie with Statistics* *Darrell Huff, How to Lie with Statistics, Norton, New York, 1954. The only reliable way to measure performance is by running actual applications on real hardware. If we want to compare performance across different contexts, this implies use of a benchmark.