Transcription
eDataanalyticsFeiHu1,Zhenlong SAGoddardSpaceFlightCenter
Outline Sciencedriver Architecture Methodology Experiment Conclusion
ScienceDriver tructureChallenges:1) Highdimensionaldata2) BigData:volume,differentdataformatsandcontents3) Time- ausingSparkandadvancedGISmethodologies.
WhyweneedSpark Faster:about10to100timesfasterthanHadoop- ‐basedMapReduceframeworkbyleveragingin- ‐memoryprocessing[1]. Iterativecomputations: considertheworkflowofMap ngandgraphanalysis lcoreoffine- ‐grained,lightweight,composable rtousethanMapReduce.Butthepuresparkcouldnot: readarray- rk/about
ArchitectureØ EasethedataanalyticsprocessØ DataanalyticsinamorepowerfulandeasierfashionØ FastspatiotemporalqueryØ Bigdatastorage
ile:Eachleafnode:- ‐ Logicaldatainfo variablename geospatialrange temporalrange chunkcorner chunkshapeEveryvariablehasaB- ‐tree- ‐ Physicaldatapointer nodelist processingofbig array- ‐basedclimatedatawithMapReduce. ce,pp.1- ‐19.
climateRDDØ climateRDDØ Chunk lename)o SI/O variablename geospatialboundary(polygon) temporalrange chunkcorner chunkshapeo Value Array
Howtoimplementthisprocess? ClimateSparko Spatiotemporalindex-retrievethechunkso climateRDD – describechunks,andstoretheirvalueso climateRDD transformation – on,interpolation,etc.o climateRDD action– dataanalytics,datavisualizationo ClimateSpark SQL - dataanalyticsintheSQLstyleRDD ResilientDistributedDataset
Spatiotemporalquery ChunkRDDPolygonRDD (1)projectiontransformationzoomin/outinterpolate DDclimateRDD(3)overlay
Aspatiotemporalqueryexample Globalboundingbox 1980/01/0100:30 23:30 1980/01/0100:30 23:30 longwavefluxattoa)
Taylor- ‐diagramService
Taylor- ‐diagramServiceTaylor- ‐diagramService 90 , 180 (90 , 180 ) 45 , 90 (45 , 90 )
WebPortal
WebPortal
ExperimentEnvironmentData:MERRA2MAT1NXINTproduct( n:1/2 *5/8 Chunksize:91*144(pixels)Numberofchunks: 23.94billionSpark- ‐YarnCluster:1masternode AM, 12.5Gbps,Ubuntu14.04
ExperimentFigure 1. monthlymeanwhenvaryingthequerytime
s.
Conclusion Sparktosupportspatiotemporalqueryofthearray- ,andvisualizetheminananimatedstylpe. ClimateSparkdesignsclimateRDD astheatomicdataabstractiontorepresentthemultiple- ‐dimensionalarray- formationsandactionsonthesearray- ‐baseddatasets. ltocomparedifferentclimatemodels(CMAP,GPCP,MERRA- ‐1,MERRA- ‐2,CFSR,ERA- ‐INTRIM)
Acknowledgements This work is supported by NASA HEC, AIST and NCCS, and NSFSpatiotemporal Innovation Center (IIP- ‐1338925) Dr. Chaowei Yang commented and advised this research Colleagues from the NSF Spatiotemporal Innovation Centerhelped with data preparation, test, and data analyses Earlier research are collaborated with Dr. Zhenlong Li
References pril.Resilientdistributeddatasets:Afault- ‐tolerantabstractionforin- ‐memoryclustercomputing.InProceedings dImplementation (pp.2- ‐2).USENIXAssociation. ets. HotCloud, 10,pp.10- ‐10. ,2006,July.HDF5- tsusingfastbitmapindices.In thInternationalConferenceon (pp.149- ‐158).IEEE. 015,October.SciSpark:Applyingin- onandtracking.In BigData(BigData),2015IEEEInternationalConferenceon (pp.2020- ‐2026).IEEE. utingFrameworkforProcessingLarge- ‐ScaleSpatialData. forefficientprocessingofbigarray- ‐basedclimatedatawithMapReduce. ce,pp.1- ‐19.
Thankyou!Anyquestionorcomments?
SpatiotemporalIndex Li,Z.,Hu,F., Schnase,J.L.,Duffy,D.Q.,Lee,T.,Bowen,M.K. and Yang,C.,2016. A spatiotemporal indexing approach for efficient processing of .