High Performance Computing For Data Intensive Science

Transcription

High Performance Computingfor Data Intensive ScienceJohn R. JohnsonComputational Sciences & Mathematics DivisionFundamental & Computational Sciences Directoratejohn.johnson@pnl.gov1

Data Intensive High Performance ComputingTraditional Computational SciencesData Intensive SciencesProblems where data is the dominating factor Computations have spatial and temporal localityProblems fit into memorySpeedMethods require high precision arithmeticVolumeComplexityData is staticUncertainty Computations have no or little localityProblems do not fit into memoryVariable precision or integer basedarithmeticData is dynamicData Intensive Research Areas Discovering algorithms for real-time processing and analysis of raw data from high throughput scientific instrumentsDeveloping techniques to interactively analyze massive (PB) archivesQuantifying uncertainty in data, models, and methodsDesigning methods for signature exploitation and discoveryDeveloping new techniques for scientific data storage and management that actively enable analyticsUnderstanding the structure, mechanics, and dynamics of complex real-world networksModeling, simulation, and analysis of large-scale networksDeveloping scalable mathematical techniques for manipulating and transforming large complex networksClimate ModelingSupernovaeEngineered SystemsMaterialsMDElectrical GridBiological NetworksHigh Energy PhysicsAstrophysics

21st Century Scientific MethodTheoryTheory is developed andexplored throughComputationTheory suggests hypothesesthat are verified throughExperimentHypotheses are discoveredin Data and drive TheoryComputationsgenerate DataDataExperimentsgenerate DataExperimentComputation3Computations inform the design of Experiments

Cyber Analytics: Canonical Problem forData Intensive ScienceAnalysis needs to identify maliciousactivity in high-throughput streamingdataMore than 10 billion transactions/dayTens of millions of unique IPaddresses observed each monthAdjacency matrix may contain over aquadrillion elements but is sparse,with billions of valuesTens of TBs PBs of raw dataPatterns can span seconds, monthsCurrent data analysis tools operateon thousands to hundreds ofthousands of AnalyticsSignaturedetectionGraphanalytics

PNNL Capabilities Leveraging ticsSignaturedetectionSignature DetectionNew initiative for FY11 start5Graphanalytics

Data-intensive HPC architecturesSolving the problem of irregular data accessPreprocessing Data Before it gets to the CPU Assumes problems will be disk-bound Allows for online compression/decompression Can filter and reduce data as it is being requestedCombining approaches optimizes data extraction at the PB-EBlevel and latency tolerant computing at the TB-PB levelCompressMemoryDMAInstruction streamAvailable for executionMultithreading in Hardware Assumes no locality Hides latency for applications that don’t have locality When memory references stall computation, switch tonew computational threadProjectCPUNICCPUNICRestrictFPGAInstruction streamInstruction streamStalled for Memory RefAvailable for executionInstruction streamThe Cray XMT can switch between threads ofexecution at the hardware level in a single cycle –each processor supports 128 concurrent threadsThreadContextSwitch RegistersALUsPBs EBsTBs PBsExtract TB network fromraw PB data for PGAThe Netezza TwinFin processes data betweendisk and CPU by compressing and filtering datausing reconfigurable hardware (FPGAs)

Data Intensive Supercomputing ArchitectureSupermicro ServerNetezza Twin Fin 12Enabling researchers and analystsfrom desktop to supercomputerLSI 7900Cray XMT128 Threads per node,128 nodes, 1 TB memory

Enabling Commercial Tools and TechnologyTableau Business Intelligence Software8

Data-intensive computingAnalysis of massive aggregates2. Analyzing traffic fromSwitzerland showsanomalous traffic is of onetype – Reset Packets (over7 million)1. Anomalous spike intraffic from SwitzerlandBy leveraging supercomputers, analystshave the ability to aggregate acrossbillions of records (terabytes of data)using commercial desktop tools toperform sophisticated analysis inminutes rather than days93. Further analysis showsevidence of port-scansjust prior to Switzerlandtraffic4.2 Billion Records – 4 weeks

Graph analyticsDeep analytics of attack signaturesOver a trillion nodesOver ½ PB simulated networktraffic dataMulti-hop path analysis48 minutes8 rack Netezza Twin FinLinear scalabilityBad IPsProtectedSystemsEasy to detect direct connectionsbetween bad actors andprotected systems by monitoringnetwork header trafficBad IPsProtectedSystemsExtremely difficult to detectwhether attack moves throughintermediate nodes, especiallylow and slow attacks that spanmany months and are embeddedin petabytes of data

Data Intensive High Performance Computing . Supermicro Server Cray XMT 128 Threads per node, 128 nodes, 1 TB memory LSI 7900 Enabling researchers and analysts from desktop to supercomputer. Enabling Commercial Tools and Technology 8 Tableau Business Intelligence Software. Data-intensive computing Analysis of massive aggregates 9 1. Anomalous .