(cuSZ )Optimizing Error-Bounded Lossy Compression For Scientific Data .

Transcription

(cuSZ ) Optimizing Error-Bounded LossyCompression for Scientific Data on GPUsJiannan Tian , Sheng Di† , Xiaodong Yu† , Cody Rivera§ , Kai Zhao‡ ,Sian Jin , Yunhe Feng , Xin Liang , Dingwen Tao , Franck Cappello† Washington State UniversityArgonne National Laboratory§The University of Alabama‡University of California, Riverside University of Washington Missouri University of Science and Technology†Thursday, Sept. 9, 2021CLUSTER ’21, Portland, Oregon, USA

: Supercomputing SystemsThe compute capability is developed much faster than storage and bandwidth: a widening gap between compute unit and storage bandwidth (PF–SB), or between main memory size and storage bandwidth (MS–SB)supercomputerCray JaguarCray Blue WatersCray CORIIBM Summityear2008201220172018class1 pflops10 pflops10 pflops100 pflopsPF1.75 pflops13.3 pflops30 pflops200 pflopsMS360 tb1.5 pb1.4 pb 10 pb PF: peak FLOPS MS: memory size SB: storage bandwidth when using burst buffer counting only DDR4supercomputerFujitsu FugakuIntel Aurora year2020futureclass“ExaScale”ExaScaleRpeak, TOP 500 for November 2020 PF537 pflops 1 eflopsSB240 gb/s1.1 tb/s1.7 tb/s 2.5 tb/sMS/SB1.5k1.3k0.8k 4kPF/SB7.3k13k17k80kSource: F. Cappello (ANL)MS4.85 pb 10 pbSBMS/SB 1.5 tb/s 3.2k 25 tb/s 0.4kPF/SB358k40kDDN NewsroomTable 3: Classes of supercomputers showing their performance, MS and SB.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 2 / 21

: Data-Heavy Scientific Applicationsapplicationdata scalepassive solution (?)HACC20 PBuse up FS10xcosmology simulationper one-trillion-particlesimulation26 PB for Mira@ANLin needCESMclimate simulationAPS-UHigh-Energy X-RayBeams Experiments50%20% vsof h/w budget for storage2013 vs 2017PBhundreds ofbrain initiatives5h30mto reduce10xto storeNSF Blue Waters, I/O at 1 TBpsin need100-PB100xbufferor, connection at 100 GBpsin needThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 3 / 21

IntroductionDesignEvaluationBackmatterWhat does data reduction of scientfic application entail?Under the context of huge imbalance between computecapability and data management,throughputprocessing uality It is required to have strict and diverse errorcontrol mechanisms toward scientific discoveryand accurate postanalysisreduction Today’s scientific research is data driven at alarge scale (simulations or instruments). Areduction rate at 10 on demand.throughput “. . . , the rate of data that can be computedon the Summit supercomputer is five orders ofmagnitude greater than the bandwidth of itsparallel file system. The I/O bottleneck is onedriver of in situ analysis.”Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 4 / 21

d Lossy Compressionrandomness: non-repeated, non-patterned std::rand() parity float (CESM-CLDHGH)randomness (noise)mantissa randomnessThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 5 / 21

d Lossy CompressionCESM-CLDHGH, data range: 0.89401052, error-bound: 1e-4, relative to the range float (32 effective bits) int (13 effective bits, incl. signum) unsigned int (8 effective bits)mantissa randomnessrounding, type-castingintegerized regarding error boundprediction recorded errors(of prediction from the previous)decreased bit randomness from left to rightThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 5 / 21

IntroductionDesignEvaluationBackmatterSZ, cuSZ, and cuSZ throughputCPU-SZ data-fidelity and compressibility oriented error-control framework [Di and Cappello 2016] quantization, unified N D predictor, highercompressibility [Tao et al. 2017] linear-regression, higher compressibility,esp. for data viz [Liang al. 2018]reductionqualityThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 6 / 21

IntroductionDesignEvaluationBackmatterSZ, cuSZ, and cuSZ throughputCPU-SZ data-fidelity and compressibility oriented error-control framework [Di and Cappello 2016] quantization, unified N D predictor, highercompressibility [Tao et al. 2017] linear-regression, higher compressibility,esp. for data viz [Liang al. 2018]cuSZ performance oriented CUDA implementation, featuring changeddata generation scheme changed, enablingmassive parallelism [Tian et al. 2020]reductionqualityThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 6 / 21

IntroductionDesignEvaluationBackmatterSZ, cuSZ, and cuSZ throughputCPU-SZ data-fidelity and compressibility oriented error-control framework [Di and Cappello 2016] quantization, unified N D predictor, highercompressibility [Tao et al. 2017] linear-regression, higher compressibility,esp. for data viz [Liang al. 2018]cuSZ performance oriented CUDA implementation, featuring changeddata generation scheme changed, enablingmassive parallelism [Tian et al. 2020]reductioncuSZ this work: expand usability by enhancing throughput or CR w/o compromising throughputqualityThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 6 / 21

IntroductionDesignEvaluationBackmatterDesign: Overview of ssless encode(3.1) error uct.(3.2) quantizationlossless decode(d, 1.1) Huff. decode(4) histogram(5) multibyte Huff.(d, 2)(d, 1.2) scatter outlier(3.3) gather t.decompressframework prequantize predict postquantize: 1 record error controling quantcode 2 create compressibility histogram lossless encode: make use of compressibilityThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 7 / 21

IntroductionDesignEvaluationBackmatterDesign: Overview of cuSZ ss encode(3.1) error uct.(3.2) quantizationlossless decodereversedpred-quant(d, 1.1) Huff. decode(4) histogram(5) multibyte Huff.(d, 2)(d, 1.2) scatter outliercompress(3.3) gather outlierLorenzoreconstruct.decompressframework prequantize predict postquantize: 1 record error controling quantcode 2 create compressibility histogram lossless encode: make use of compressibilityproblem statement (cuSZ) Huffman-only reduction rate 32 (float)extra pattern-finding stage low performance Lorenzo reconstruction is coarse-grained.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 7 / 21

IntroductionDesignEvaluationBackmatterDesign: Overview of cuSZ ss encode(3.1) error uct.(3.2) quantizationreversedpred-quant(d, 1.1) Huff. decode(4) histogram(5) multibyte Huff.(d, 2)(d, 1.2) scatter uct.choose(3.3) gather outliercuSZ lossless decode(3.1) error control(3.2) quantization(4.1) hist.compress(d, 1.1a)(5.a) RLEreversed(VLE ) RLE(d, 1.1b) Huff. decode(4.2) decidedecompress(d, 2)( VLE)Lorenzoreconstruct.Lorenzoreconstruct.(5.b) multibyte Huff.(d, 1.2) scatter outlier(3.3) gather outlierproblem statement (cuSZ) Huffman-only reduction rate 32 (float)extra pattern-finding stage low performance Lorenzo reconstruction is coarse-grained.solution (cuSZ ) workflow-optimized run-length encoding (RLE):RLE ( optional VLE) Lorenzo reconstruction N -D partial-sumThursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 7 / 21

IntroductionDesignEvaluationBackmatterDesign, Compressibility-Awareness (1/4): ContextHACC 655361e-2 256 1e-31e-41e-21e-31e-4Figure 1: Error-controling quantcodes can be 1- ormulti-byte, depending on the diversity of quantcodes.Hurricaneqgqhqhgqgqhqhg22.721.1 7.580.8 3.890.8 20.331.0 9.511.0 4.821.0 31.021.5 10.011.1 5.011.0 43.671.8 18.411.1 10.311.1 24.801.0 17.041.0 9.761.0 58.762.4 24.651.4 12.991.3 qgqhqhgqgqhqhg61.212.5 20.781.1 9.981.0 24.241.0 18.381.0 10.291.0 75.503.1 28.131.5 12.501.2 118.943.9 28.251.2 12.870.8 30.241.0 23.921.0 15.271.0 164.395.4 40.171.7 17.951.2 CESMNyxTable 4: Averaged compression ratios (per dataset) of differentcompression schemes on 109 fields of 4 datasets with 3 error boundsof 10 2 , 10 3 , 10 4 (relative to value range). q denotes quantcodes asstarting point, h denotes customized variable-length encoding(multibyte-symbol Huffman coding), g denotes gzip-featured scheme.“ab” denotes scheme a precedes scheme b.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 8 / 21

IntroductionDesignEvaluationBackmatterDesign, Compressibility-Awareness (2/4): Using RLEWhat is RLE? a form of lossless compression [Robinson and Cherry 1967] :runs of data, sequences of consecutive same-valueelements, stored as value-count tuples example: aaaaaaaaaaaabbcaaaaaaaa (a,12)(b,2)(c,1)(a,8) pro: performs well on “smooth” data con: overhead from storing count (no prefixproperty like Huffman coding)Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 9 / 21

IntroductionDesignEvaluationBackmatterDesign, Compressibility-Awareness (2/4): Using RLEWhat is RLE? a form of lossless compression [Robinson and Cherry 1967] :runs of data, sequences of consecutive same-valueelements, stored as value-count tuples example: aaaaaaaaaaaabbcaaaaaaaa (a,12)(b,2)(c,1)(a,8) pro: performs well on “smooth” data con: overhead from storing count (no prefixproperty like Huffman coding)How to determine the “smoothness”? inspiration: variogram [Cressie and Hawkins 1980]measuring multidimensional data variance, spatialdata, sampling basedh 2 i2γ (s1 , s2 ) var Z(s1 ) Z(s2 ) E Z(s1 ) Z(s2 ),where Z(s) is a spatial random field variant: madogram, using absolute difference Z(s1 ) Z(s2 ) in place of square variance further adjust to binary difference 0 vthis vnext,1 vthis ̸ vnextwhose averaged value (per distance) is RLEroughness; smoothness (1 roughness)Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 9 / 21

IntroductionDesignEvaluationBackmatterDesign, Compressibility-Awareness (3/4)estimate Huffman CR from histogramP starting with entropy: H (X ) pi log2 pi ⟨b⟩ to denote the average bit-length redundancy R to denote the discrepancy between ⟨b⟩Huffand entropy H (X ) (i.e., R ⟨b⟩Huff H (X )). p1 to denote the probability of the most likely symbolestimate CR of 1) RLE and 2) RLE VLE32 RLE only: CRRLE ⟨b⟩Huff R 32 RLE VLE: CRRLE VLE CRRLE ·,⟨b⟩Huff R with CRRLE knownthe upper and lower bounds of R lower bound [Johnsen 1980] : R R 1 H (p1 , 1 p1 ), H (p1 , 1 p1 ) p1 log2 p1 0.4 (sufficiency)1p1 (1 p1 ) log211 p1 upper bound [Gallager 1978] : R R p1 0.086 (norestriction)Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 10 / 21

IntroductionDesignEvaluationBackmatterDesign, Compressibility-Awareness (4/4)(a) Smoothness against distance. Yellow line: linear(b) Smoothness-probability of the most likely symbolregression of variances at distances.relationship.Figure 2: CESM FSDSC at 1e-2. Smoothness of error-controling quantcodes. to determine when to use RLE. For example, athreshold compression ratio can be set to 32 to find the desired smoothness, and the smoothness can be directed to theprobability of the most likely symbol. Empirical average bitwidth: 1.09 bit from Huffman coding (obtainable from histogram).Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 11 / 21

IntroductionDesignEvaluationBackmatterDesign, Lorenzo Reconstructiondata reconstruction: d p q′d reconstructed datappredicted valueq′ error-controling “quantcode”All are integer values.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 12 / 21

IntroductionDesignEvaluationBackmatterDesign, Lorenzo Reconstructiondata reconstruction: d p q′d reconstructed datappredicted valueq′ error-controling “quantcode”All are integer values.Prediction is given by extrapolative{1,2,3}-D Lorenzo,(x 1, y 1) p[x] d[x 1]p[y,x] d[y 1,x 1] d[y 1,x] d[y,x 1](x 1, y) p[z,y,x] d[z 1,y 1,x 1] d[z 1,y 1,x d[z,y 1,x] d[z] d[z 1,y,y 1,x 1],x] (x, y 1)(x, y)(x 1, y 1, z 1) (x 1, y 1, z) (x 1, y, z 1) (x 1, y, z) (x, y 1, z 1) (x, y 1, z) (x, y, z 1)(x, y, z) d[z 1,y,x 1] d[z,y,x 1]Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 12 / 21

IntroductionDesignEvaluationBackmatterDesign, Lorenzo Reconstructiondata reconstruction: d p q′ dpq′ d[x] xX′q[i]for 1D,i 0reconstructed data d[z,y,x] predicted valueyxz XXX d[y,x] yxXX′q[j,i]for 2Dj 0 i 0′q[k,j,i]for 3Dk 0 j 0 i 0error-controling “quantcode”All are integer values.Prediction is given by extrapolative{1,2,3}-D Lorenzo,p[x] d[x 1]p[y,x] d[y 1,x 1] d[y 1,x] d[y,x 1]p[z,y,x] d[z 1,y 1,x 1] d[z 1,y 1,x d[z,y 1,x] d[z] d[z 1,y,y 1,x 1],x] d[z 1,y,x 1] d[z,y,x 1]Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 12 / 21

IntroductionDesignEvaluationBackmatterDesign, Lorenzo Reconstructiondata reconstruction: d p q′ dpq′ d[x] xX′q[i]for 1D, d[y,x] yxXXi 0reconstructed data d[z,y,x] predicted value′q[j,i]for 2Dj 0 i 0yxz XXX′q[k,j,i]for 3Dk 0 j 0 i 0error-controling “quantcode”All are integer values.Prediction is given by extrapolative{1,2,3}-D Lorenzo,It is proven by (2D as the showcase) d[y 1,x 1] p[x] d[x 1]p[y,x] d[y 1,x 1] d[y 1,x] d[y,x 1] ] d[z d[z] d[z 1,y,y 1,x,y 1,x 1],x]′q[j,i] y x 1XXj 0 i 0p[z,y,x] d[z 1,y 1,x 1] d[z 1,y 1,xyxXX d[z 1,y,x 1] d[zyXi 0′′q[j,i] j 0 i 0′q[j,x 1] q[y 1,x 1] y 1 xXX′′q[j,i] q[y 1,x 1]j 0 i 0y 1 xXXj 0 i 0′q[j,i] y 1 x 1XX′q[j,i] .j 0 i 0,y,x 1]Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 12 / 21

IntroductionDesignEvaluationBackmatterDesign, Lorenzo Reconstruction (cont’d)It is proven by (2D as the showcase) d[y 1,x 1] yxXX′q[j,i] y x 1XXj 0 i 0 yX′q[j,i] j 0 i 0′yxXX′q[j,x 1] q[y 1,x 1] y 1 xXX′q[j,i] j 0 i 0′y 1 x 1XX′q[j,i] .j 0 i 00q[j,i]j 0 i 0partial-sum along x y x 1XX0q[j,i]′ qy 2,x 1pΣ (x 1) y 2 pΣ (x) y 2j 0 i 0′ qy 1,x 1pΣ (x 1) y 1 pΣ (x) y 1y 1 ′q[j,i] q[y 1,x 1]j 0 i 0i 0 y 1 xXXxXX0q[j,i]0 q[y 1,x 1]j 0 i 0(a) Concept of 2D partial sum.pΣ (x 1) y qy′,x 1 pΣ (x) ypartial-sum along y pΣ pΣ (x) y 2 x pΣ pΣ (x) y 1 pΣ pΣ (x) yx x pΣ(x) y 2 pΣ pΣ (x) y 1 pΣ(x) y 1 pΣ pΣ (x) yx pΣ(x) y pΣ pΣ (x) y 1xx(b) Exemplary 2-pass partial-sum computation for 2D data reconstruction.Figure 3: Example of 2D partial-sum computation for Lorenzo reconstruction in cuSZ ’s decompression.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 13 / 21

IntroductionDesignEvaluationBackmatterEvaluation: SetupdatasetscosmologyHACC platformclimate V100 (SXM2, 16 GB) of TACC Longhorn A100 (SXM4, 40 GB) of ANL ThetaGPUCESM-ATMclimateHurricane! A100 of SXM4: 30% better than its PCIe variant evaluation cuSZ cuSZ on V100 Lorenzo construction and reconstruction Huffman encoding (enhancement) scaling cuSZ on A100cosmologyNyxseismic waveRTMhydrodynamicsMiranda Quantum Monte CarloQMCPACKdatum size#fieldsdimensionsexamples(s)1,071.75 MB6 in total280,953,867x, vx24.72 MB77 in total1,800 3,600CLDHGH, PHIS95.37 MB20 in total100 500 500CLOUDf48, Uf48512 MB6 in total512 512 512baryon desnity180.72 MB10 in total449 449 235snapshot28{0.9}0144 MB7 in total256 384 384density, pressure601.52 MB2 in total288x115x69x69preconditionedTable 5: Real-world datasets used in evaluation.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 14 / 21

IntroductionDesignEvaluationBackmatterEvaluation: Workflow-RLE (1/2)cuSZ gzipcuSZ(qhg) ref.(qh ) VLERLEgainRLE 58.9245.691.21 1.07 1.08 AEROD FRACOCNFRACODV bcar1ODV bcar2ODV dust1ODV dust2ODV dust3ODV dust4ODV ocar1ODV ocar2PHISPRECSCPRECSLoursours1.09 1.67 -1.44 1.19 -1.28 1.79 1.71 1.40 -2.99 4.57 1.26 1.49 1.43 1.99 1.71 1.36 4.28 3.48 2.59 2.69 3.76 5.34 5.04 4.09 1.15 2.28 1.79 cuSZ gzipcuSZ(qhg) ref.(qh ) VLERLEgainRLE 1.4030.6427.0724.691.49 1.06 1.80 OP PTROP TTROP ZTSMXoursours1.67 -2.48 4.57 1.33 1.44 1.02 1.11 1.27 1.23 1.10 1.03 Table 6: Data fields that cuSZ with Workflow-RLE hashigher compression ratio than cuSZ with Workflow-Huffmanunder 10 2 error bound. “gain” is based on ours against (qh)VLE from cuSZ.The additional VLE can steadily offer 3x in CR.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 15 / 21

IntroductionDesignEvaluationBackmatterEvaluation: Workflow-RLE (2/2)V100 (GB/s)A100 cuSZ142.4135.757.855.1212.6233.978.080.876.0 31.7 5.526.1 23.0 4.8122.7 31.0 Table 7: Throughputs (in GB/s) of cuSZ (based on RLE) and cuSZ (based on Huffman coding) on example RTM, CESM, and Nyxfields for a demonstration purpose.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 16 / 21

IntroductionDesignEvaluationBackmatterEvaluation: Optmization, cuSZ cuSZ 1071.824.795.4512.0601.5HACCCESMHurrNyxQMCcuSZcuSZ 207.7307.41.48 252.1273.91.09 175.8229.91.31 200.2296.01.48 189.6298.61.57 HuffmanencodecuSZcuSZ 54.158.31.08 57.2107.71.88 55.2111.22.01 58.8120.52.05 61.0110.81.82 LorenzoreconstructcuSZcuSZ 16.8313.118.64 58.5254.24.35 43.9218.44.97 29.7238.18.02 22.4255.511.41 size in MBV100LorenzoconstructTable 8: Performance comparison of Lorenzo and Huffman encoding kernels in cuSZ and cuSZ on V100. The unit is in GB/s.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 17 / 21

IntroductionDesignEvaluationBackmatterEvaluation: Optmization, Scaling cuSZ to A100328.3501.11.53 273.9466.81.70 199.0429.02.16 296.0481.31.63 298.6492.91.65 HuffmanencodeV100A10058.3174.62.99 107.7121.61.13 111.2206.01.85 120.5217.21.80 110.8198.41.79 LorenzoreconstructV100A100308.7504.41.63 267.0495.31.86 200.1345.51.73 251.7398.61.58 255.5384.01.50 size in MBcuSZ LorenzoconstructTable 9: Evaluation of cuSZ using default compression workflow (Lorenzo and multibyte VLE) with relative error bound of10 4 on V100 and A100: breakdown throughput of compression subprocedures.reference only:A100 memory throughput: 1555 GB/s (1.38 V100)A100 compute capability: 19.49 FP32 TFLOPS (1.73 V100)Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 18 / 21

IntroductionDesignEvaluationBackmatterEvaluation: Overall Throughput (Default Workflow)V100-ours, GB/ssize in MB1071.824.795.4 512.0 180.7HACC CESM HurrA100-ours, GB/s (and advantage over V100)144.0 601.51071.824.795.4512.0180.7144.0601.5Nyx RTM Mira. QMCHACCCESMHurrNyxRTMMirandaQMCℓ constructgather outlierhistogramHuffman , comp.42.144.849.353.952.562.256.9Huffman dec.scatter outlierℓ reconstruct42.1225.0308.7overall, decomp.31.837.9 45.8 66.8 48.9334.8 628.1 359.7 440.2267.0 200.1 251.7 201.330.235.246.036.142.7 44.6679.1 347.1245.3 255.534.534.2501.1324.8923.5174.61.53 1.47 1.63 2.99 84.1 2.00 466.8151.4409.8121.61.70 0.94 1.15 1.13 51.5 1.15 429.0284.2681.2206.02.16 1.13 1.55 1.85 82.2 1.67 481.3334.9870.2217.21.63 1.41 2.34 1.80 92.4 1.72 422.7221.6793.9202.22.19 0.89 1.38 1.64 76.4 1.46 480.7336.0714.9201.61.66 1.47 1.46 1.25 87.6 1.41 492.9266.2569.7198.41.65 1.02 0.79 1.79 79.5 1.40 48.5 1.15 26.6 0.70 51.8 1.13 91.2 1.37 56.0 1.15 50.1 1.17 49.0 1.10 658.4 2.93 630.2 1.88 918.3 1.46 797.4 2.22 906.6 2.06 1066.8 1.57 782.8 2.26 504.4 1.63 495.3 1.86 345.5 1.73 398.6 1.58 335.6 1.67 386.9 1.58 384.0 1.50 41.4 1.30 24.3 0.80 43.0 1.22 67.9 1.47 45.6 1.26 42.6 1.23 41.2 1.20 Table 10: ℓ for Lonrenzo-prediction. Evaluation of cuSZ using default compression workflow (Lorenzo and multibyte VLE)with relative error bound of 10 4 on V100 and A100: breakdown throughput of compression subprocedures.boldface: underscaled, 1.20 Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 19 / 21

ent (ECP)This R&D was supported by the Exascale Computing Project (ECP), ProjectNumber: 17-SC-20-SC, a collaborative effort of two DOE organizations—the Officeof Science and the National Nuclear Security Administration, responsible for theplanning and preparation of a capable exascale ecosystem. This repository wasbased upon work supported by the U.S. Department of Energy, Office of Science,under contract DE-AC02-06CH11357, and also supported by the National ScienceFoundation under Grants SHF-1617488, SHF-1619253, OAC-2003709,OAC-1948447/2034169, and OAC-2003624.Thursday, Sept. 9, 2021 · CLUSTER ’21, Portland, Oregon, USA · cuSZ · 20 / 21

IntroductionDesignEvaluationTHANK YOUQuestion?contact usgithub.com/szcompressor/cuSZJiannan Tianjiannan.tian@wsu.eduDr. Dingwen Taodingwen.tao@wsu.eduSZ: A Lossy CompressionFramework for Scientific DataArgonne National LaboratoryBackmatter

Introduction Design Evaluation Backmatter Design, Compressibility-Awareness (2/4): Using RLE What is RLE? a form of lossless compression [Robinson and Cherry 1967]: runs of data, sequences of consecutive same-value