ClimateSpark:-AHigh-Performance- Framework-for-Big-Climate-Data-analytics

Transcription

eDataanalyticsFeiHu1,Zhenlong SAGoddardSpaceFlightCenter

Outline Sciencedriver Architecture Methodology Experiment Conclusion

ScienceDriver tructureChallenges:1) Highdimensionaldata2) BigData:volume,differentdataformatsandcontents3) Time- ausingSparkandadvancedGISmethodologies.

WhyweneedSpark Faster:about10to100timesfasterthanHadoop- ‐basedMapReduceframeworkbyleveragingin- ‐memoryprocessing[1]. Iterativecomputations: considertheworkflowofMap ngandgraphanalysis lcoreoffine- ‐grained,lightweight,composable rtousethanMapReduce.Butthepuresparkcouldnot: readarray- rk/about

ArchitectureØ EasethedataanalyticsprocessØ DataanalyticsinamorepowerfulandeasierfashionØ FastspatiotemporalqueryØ Bigdatastorage

ile:Eachleafnode:- ‐ Logicaldatainfo variablename geospatialrange temporalrange chunkcorner chunkshapeEveryvariablehasaB- ‐tree- ‐ Physicaldatapointer nodelist processingofbig array- ‐basedclimatedatawithMapReduce. ce,pp.1- ‐19.

climateRDDØ climateRDDØ Chunk lename)o SI/O variablename geospatialboundary(polygon) temporalrange chunkcorner chunkshapeo Value Array

Howtoimplementthisprocess? ClimateSparko Spatiotemporalindex-retrievethechunkso climateRDD – describechunks,andstoretheirvalueso climateRDD transformation – on,interpolation,etc.o climateRDD action– dataanalytics,datavisualizationo ClimateSpark SQL - dataanalyticsintheSQLstyleRDD ResilientDistributedDataset

Spatiotemporalquery ChunkRDDPolygonRDD (1)projectiontransformationzoomin/outinterpolate DDclimateRDD(3)overlay

Aspatiotemporalqueryexample Globalboundingbox 1980/01/0100:30 23:30 1980/01/0100:30 23:30 longwavefluxattoa)

Taylor- ‐diagramService

Taylor- ‐diagramServiceTaylor- ‐diagramService 90 , 180 (90 , 180 ) 45 , 90 (45 , 90 )

WebPortal

WebPortal

ExperimentEnvironmentData:MERRA2MAT1NXINTproduct( n:1/2 *5/8 Chunksize:91*144(pixels)Numberofchunks: 23.94billionSpark- ‐YarnCluster:1masternode AM, 12.5Gbps,Ubuntu14.04

ExperimentFigure 1. monthlymeanwhenvaryingthequerytime

s.

Conclusion Sparktosupportspatiotemporalqueryofthearray- ,andvisualizetheminananimatedstylpe. ClimateSparkdesignsclimateRDD astheatomicdataabstractiontorepresentthemultiple- ‐dimensionalarray- formationsandactionsonthesearray- ‐baseddatasets. ltocomparedifferentclimatemodels(CMAP,GPCP,MERRA- ‐1,MERRA- ‐2,CFSR,ERA- ‐INTRIM)

Acknowledgements This work is supported by NASA HEC, AIST and NCCS, and NSFSpatiotemporal Innovation Center (IIP- ‐1338925) Dr. Chaowei Yang commented and advised this research Colleagues from the NSF Spatiotemporal Innovation Centerhelped with data preparation, test, and data analyses Earlier research are collaborated with Dr. Zhenlong Li

References pril.Resilientdistributeddatasets:Afault- ‐tolerantabstractionforin- ‐memoryclustercomputing.InProceedings dImplementation (pp.2- ‐2).USENIXAssociation. ets. HotCloud, 10,pp.10- ‐10. ,2006,July.HDF5- tsusingfastbitmapindices.In thInternationalConferenceon (pp.149- ‐158).IEEE. 015,October.SciSpark:Applyingin- onandtracking.In BigData(BigData),2015IEEEInternationalConferenceon (pp.2020- ‐2026).IEEE. utingFrameworkforProcessingLarge- ‐ScaleSpatialData. forefficientprocessingofbigarray- ‐basedclimatedatawithMapReduce. ce,pp.1- ‐19.

Thankyou!Anyquestionorcomments?

SpatiotemporalIndex Li,Z.,Hu,F., Schnase,J.L.,Duffy,D.Q.,Lee,T.,Bowen,M.K. and Yang,C.,2016. A spatiotemporal indexing approach for efficient processing of .