DMon: Efficient Detection And Correction Of Data Locality . - USENIX

Transcription

DMon: Efficient Detection andCorrection of Data LocalityProblems Using Selective ProfilingTanvir Ahmed Khan, Ian Neal, Gilles Pokam, Barzan Mozafari, Baris Kasikci

Millions of dollars inmanagement andenergy cost Planet-scale carbonfootprint 700 msRunning thatsingle searchquery requires 8processor cores1!Active usage ofmillions ofprocessor cores toserve the planet![1] Memory Hierarchy for Web Search, HPCA 20182

CPU Performance of Google Web Search121%32%Memory stalls 8%due to poordata localityDoing useful workFetching instructionsDecoding instructionsBad speculationExecution units unavailable15%14%Memory stalls10%[1] AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers, ISCA 20193

Existing Techniques & Why They Fall Short?Compiler OptimizationsDynamic Profilers Automatically improve data locality Help developers identify andvia program transformationresolve poor data locality No run-time overhead Accurate execution information Rely on static heuristics Mostly manual repair Can sometimes even hurt High profiling overhead when usedperformanceto detect data locality issues4

DMon’s Contributions Selective profiling to detect data locality problems accurately andefficiently Apply specific compiler optimizations based on profiling results Evaluation showing the efficiency of selective profiling andeffectiveness of targeted optimizations Negligible (less than 2%) overhead 17% average speedup for popular benchmarks from PARSEC, SPLASH, NPB 7% average speedup for PostgreSQL5

DMon’s Design Continuous in-productionmonitoring to identify datalocality problems In-house static analysis toidentify memory accesspattern In-house statictransformations to optimizelocalityIn 1001011 Selective sStatic MemoryAccess PatternAnalysis 26

Layer 2Layer 3Layer 4 Leverage thehierarchical Topdown approachfrom Intel Not all problemsare related todata locality Only focus on asmall subtreerelated to datalocalityLayer 1Targeted reBoundL1 BoundBadSpeculationL2 BoundL1 cachemissesL3 BoundL2 cachemissesDRAMBoundL3 cachemissesDataLocalityTree7

Incremental Monitoring0p (100 ms)2p3pTimeLayer 1Back-endBound 10%Layer 2MemoryBound 10%Layer 3L2 or L3 orDRAMBound 10%Layer 4 Monitor the programexecution in short timeslices Incrementally enablemore detailed profiling Can identify evendifferent localityproblems at variousprogram phasesCollectcache misssamples8

Offline Analysis and tionDetermineStructure AccessPattern9

Evaluation Summary Efficiency On average 1.36% overhead 9x lower overhead than state-of-the-art data locality profiler Effectiveness Accurately detect data locality problems for benchmarks from PARSEC,SPLASH-2X, and NPB suites On average 16.83% and up to 53.14% speedup 20% more speedup than state-of-the-art profile-guided data locality optimizer Real-world case studies PostgreSQL, Apache-spark page-rank, and others10

Performance Speedup on PostgreSQLbetter2015Speedup10(%)5012345678910 11 12 13 14 15 16 17 18 19 20 21 22TPC-H QueriesDMon speeds up PostgreSQL by 7% on average11

DMon: Data Locality Optimizations viaSelective Profiling Selective profiling to detectdata locality problemsaccurately and efficiently Apply specific optimizationsbased on profiling results 17% speedup withnegligible (less than 2%)overheadIn 1001011 Selective sgithub.com/efeslab/DMon-AEtakh@umich.eduStatic MemoryAccess PatternAnalysis 212

Correction of Data Locality Problems Using Selective Profiling Tanvir Ahmed Khan, Ian Neal . understanding and mitigating front-end stalls in warehouse-scale computers, ISCA 2019 CPU Performance of Google Web Search1 Memory stalls . PostgreSQL, Apache-spark page-rank, and others 10. Performance Speedup on PostgreSQL TPC-H Queries better