Defining Architecture Components Of The Big Data Ecosystem

Transcription

Defining Architecture Components ofthe Big Data EcosystemYuri DemchenkoSNE Group, University of Amsterdam2nd BDDAC2014 Symposium, CTS2014 Conference19-23 May 2014, Minneapolis, USA

Outline Big Data and Data Intensive Science as a new technology wave– The Fourth Paradigm Big Data definition: From 6 Vs to 5 parts Big Data technology drivers– Where do the data come from? What are Big Data drivers? Big Data: Paradigm change and new challenges– From Big Data to All-Data – Moving to data centric service models Defining Big Data Architecture Framework (BDAF)– Big Data Infrastructure (BDI) and Big Data Analytics infrastructure/tools Summary and DiscussionBDDAC2014 @CTS2014Big Data Architecture FrameworkSlide 2

Big Data and Security Research at System andNetwork Engineering, University of Amsterdam Long time research and development on Infrastructure services and facilities– High speed optical networking and data intensive applications– Semantic description of infrastructure and network services– Collaborative systems, Grid, Clouds and currently Big Data Focus on Infrastructure definition and services– Software Defined Infrastructure based on Cloud/Intercloud technologies– Dynamically provisioned security infrastructure and services NIST Big Data Working Group– Contribution to Reference Architecture, Big Data Definition and Taxonomy, Big DataSecurity Research Data Alliance (RDA)– Interest Group on Education and Skills Development on Data Intensive Science– Big Data Analytics Interest Group Big Data Interest Group at UvA– Non-formal but active, meets two-weekly/monthly– Provided input to NIST BD-WG and RDA activitiesBDDAC2014 @CTS2014Big Data Architecture Framework3

Technology Definitions and Timeline - Overview Service Oriented Architecture (SOA): First proposed in 1996 and revived withthe Web Services advent in 2001-2002– Currently standard for industry, and widely used– Provided a conceptual basis for Web Services development Computer Grids: Initially proposed in 1998 and finally shaped in 2003 with theOpen Grid Services Architecture (OGSA) by Open Grid Forum (OGF)– Currently remains as a collaborative environment– Migrates to cloud and inter-cloud platform Cloud Computing: Initially proposed in 2008 – Now in a productive phase– Defined new features, capabilities, operational/usage models and actually provideda guidance for the new technology development– Originated from the Service Computing domain and service management focused Big Data and Data Intensive Science: Yet to be defined– Involves more components and processes to be included into the definition– Can be better defined as Ecosystem where data are the main driving component– Need to define the Big Data properties, expected technology capabilities and provide aguidance/vision for future technology developmentBDDAC2014 @CTS2014Big Data Architecture Framework4

Visionaries and Drivers:Seminal works, High level reports, ActivitiesThe Fourth Paradigm: Data-Intensive Scientific Discovery.By Jim Gray, Microsoft, 2009. Edited by Tony Hey, et on/fourthparadigm/Riding the wave: How Europe can gain fromthe rising tide of scientific data.Final report of the High Level Expert Group onScientific Data. October org/NIST Big Data Working Group (NBD-WG)https://www.rd-alliance.org/AAA Study: Study onAAA Platforms ForScientificdata/informationResources in Europe,TERENA, UvA, LIBER,UinvDeb.(2011-2012)ISO/IEC JTC1 Big Data Study Group 014 @CTS2014Big Data Architecture Framework5

The Fourth Paradigm of Scientific Research1. Theory and logical reasoning2. Observation or Experiment––E.g. Newton observed apples falling to design his theory ofmechanicsBut Gallileo Galilei made experiments with falling objects from thePisa leaning tower3. Simulation of theory or model–Digital simulation can prove theory or model4. Data-driven Scientific Discovery (aka Data Science)– More data beat hypnotized theoryBDDAC2014 @CTS2014Big Data Architecture Framework6

Gartner Technology Hypercycle (October 2013)Big DataCloud ComputingSource logies/hype-cycle.jspBDDAC2014 @CTS2014Big Data Architecture Framework7

Our/SNE Big Data Technology Research CycleNew style of technology developmentTechnology consolidationMid-End 2013Big DataCloud ComputingEnd 20142012Mid 20142011Active and productive researchTeaching on Big Data Tech/InfraRemote BD technology following.EU Study AAA for Research DataMain research in Cloud/IntercloudComponent technologies masteringEducation courses developmentActive research into Big Data domain definitionBuilding community linksSource logies/hype-cycle.jspBDDAC2014 @CTS2014Big Data Architecture Framework8

Big Data Definitions Overview IDC definition of Big Data (conservative and strict approach) :"A new generation of technologies and architectures designed toeconomically extract value from very large volumes of a wide variety of databy enabling high-velocity capture, discovery, and/or analysis“ Gartner definitionBig data is high-volume, high-velocity and high-variety information assetsthat demand cost-effective, innovative forms of information processing forenhanced insight and decision /– Termed as 3 parts definition, not 3V definition Big Data: a massive volume of both structured and unstructured data that isso large that it's difficult to process using traditional database and softwaretechniques.– From “The Big Data Long Tail” blog post by Jason Bloomberg (Jan 17, ail.html “Data that exceeds the processing capacity of conventional databasesystems. The data is too big, moves too fast, or doesn’t fit the structures ofyour database architectures. To gain value from this data, you must choosean alternative way to process it.”– Ed Dumbill, program chair for the O’Reilly Strata ConferenceBDDAC2014 @CTS2014Big Data Architecture Framework9

Improved: 6 (5 1) V’s of Big DataVolumeVariety kedDynamic TerabytesRecords/ArchTables, FilesDistributed Adopted in generalby NIST BD-WGBDDAC2014 @CTS2014BatchReal/near-timeProcessesStreams6 Vs ofBig Data Changing data Changing model LinkageVariabilityVelocity tworthinessAuthenticityOrigin, ReputationAvailabilityAccountabilityGeneric Big DataProperties Volume Variety VelocityAcquired Properties(after entering system) Value Veracity VariabilityCommonly accepted3V’s of Big DataVeracityBig Data Architecture Framework10

Big Data Definition: From 6V to 5 Parts (1)(1) Big Data Properties: 5V– Volume, Variety, Velocity, Value, Veracity– Additionally: Data Dynamicity (Variability)(2) New Data Models– Data Lifecycle and Variability– Data linking, provenance and referral integrity(3) New Analytics–Real-time/streaming analytics, interactive and machine learning analytics(4) New Infrastructure and Tools––––High performance Computing, Storage, NetworkHeterogeneous multi-provider services integrationNew Data Centric (multi-stakeholder) service modelsNew Data Centric security models for trusted infrastructure and data processingand storage(5) Source and Target– High velocity/speed data capture from variety of sensors and data sources– Data delivery to different visualisation and actionable systems and consumers– Full digitised input and output, (ubiquitous) sensor networks, full digital controlBDDAC2014 @CTS2014Big Data Architecture Framework11

Big Data Definition: From 6V to 5 Parts (1)(1) Big Data Properties: 5V– Volume, Variety, Velocity, Value, Veracity– Additionally: Data Dynamicity (Variability)(2) New Data Models– Data linking, provenance and referral integrity– Data Lifecycle and Variability/Evolution(3) New Analytics–Real-time/streaming analytics, interactive and machine learning analytics(4) New Infrastructure and Tools––––High performance Computing, Storage, NetworkHeterogeneous multi-provider services integrationNew Data Centric (multi-stakeholder) service modelsNew Data Centric security models for trusted infrastructure and data processingand storage(5) Source and Target– High velocity/speed data capture from variety of sensors and data sources– Data delivery to different visualisation and actionable systems and consumers– Full digitised input and output, (ubiquitous) sensor networks, full digital controlBDDAC2014 @CTS2014Big Data Architecture Framework12

Big Data Definition: From 6V to 5 Parts (2)Refining Gartner definition“Big data is (1) high-volume, high-velocity and high-variety information assets thatdemand (3) cost-effective, innovative forms of information processing for (5)enhanced insight and decision making” Big Data (Data Intensive) Technologies are targeting to process (1) high-volume,high-velocity, high-variety data (sets/assets) to extract intended data value andensure high-veracity of original data and obtained information that demand costeffective, innovative forms of data and information processing (analytics) forenhanced insight, decision making, and processes control; all of those demand(should be supported by) new data models (supporting all data states and stagesduring the whole data lifecycle) and new infrastructure services and tools thatallows also obtaining (and processing data) from a variety of sources (includingsensor networks) and delivering data in a variety of forms to different data andinformation consumers and devices.(1) Big Data Properties: 5V(2) New Data Models(3) New Analytics(4) New Infrastructure and Tools(5) Source and TargetBDDAC2014 @CTS2014Big Data Architecture Framework13

Big Data Origin Science Internet, Web Industry Business Living Environment,Cities Social media andnetworks Healthcare Telecom/InfrastructureData TransformationBig Data Nature: Origin and Target (consumers)Big Data Target Use Scientific discovery New technologies Manufacturing,processes, transport Personal services,campaigns Living environmentsupport Healthcare support Social NetworkingVolume, Velocity, Variety & Value, Veracity, VariabilityBDDAC2014 @CTS2014Big Data Architecture Framework14

Big Data technology drivers (1) Modern e-Science in search for new knowledge– Scientific experiments and tools are becoming bigger andheavily based on data processing and mining– The long tail of science Traditional data intensive industry– Genomic research, drugs development, Healthcare– High-tech industry, CAD/CAM, weather/climate, etc. Customer facing industry and companies– Advertisement, retail business, service delivery Intelligence and security Network/infrastructure management– Network monitoring, Intrusion detection, troubleshootingBDDAC2014 @CTS2014Big Data Architecture Framework15

The Long Tail of Science (aka “Dark Data”) Collectively “Long Tail” science is generating a lot of data– Estimated as over 1PB per year and it is growing fast with the newtechnology proliferation 80-20 rule: 20% users generate 80% data but not necessarily 80%knowledgeSource: Dennis Gannon (Microsoft)BDDAC2014 @CTS2014Big Data Architecture FrameworkNIST Big Data Workshop, 201216

Big Data technology drivers – Technology Loop Technology loop (known as Jevons Paradox)– Increased efficiency to process current demand will createnew uses and increase demand even moreElastic Demand for Work:A doubling of fuelefficiency more thandoubles work demanded,increasing the amount offuel used.Jevons paradox occurs.BDDAC2014 @CTS2014Big Data Architecture Framework17

Big Data technology drivers (2) – Managing publiccampaigns, e.g. election, public relations The rise of public opinion stored in platforms like Twitter,Google, Facebook, etc. provide enough intelligence toinfluence the campaign development, timing, geography andeven the colour of the campaign signs– Twitter was a major source of data aggregation for the RepublicanRace in the US– Multimillion-dollar contract for data managementand collection services awarded May 1, 2013 toLiberty Work to build advanced list of voters Article “In Data we trust” by T.Edsall in The New York Times– Book: In Data We Trust: How Customer Data isRevolutionising Our Economy (Aug 2012) A strategy for tomorrow's data worldBDDAC2014 @CTS2014Big Data Architecture Framework18

NIST Big Data Working Group (NBD-WG) andISO/IEC JTC1 Study Group on Big Data (SGBD) Started June 2013 - http://bigdatawg.nist.gov/home.php– Weekly calls, open participation, mailing list Targeted formal delivery Autumn 2014 of a set of NIST documentshttp://bigdatawg.nist.gov/V1 output docs.phpVolume 1: NIST Big Data DefinitionsVolume 2: NIST Big Data TaxonomiesVolume 3: NIST Big Data Use Case & Requirements (co-chair Geoffrey Fox)Volume 4: NIST Big Data Security and Privacy RequirementsVolume 5: NIST Big Data Architectures White Paper SurveyVolume 6: NIST Big Data Reference ArchitectureVolume 7: NIST Big Data Technology Roadmap ISO/IEC Study Group on Big Data (SGBD)http://jtc1bigdatasg.nist.gov/home.php– Term (December 2013) – September 2014– Extends NIST BDWG activity and scope– 2nd meeting hosted in Amsterdam 13-16 May AC2014 @CTS2014Big Data Architecture Framework19

2nd ISO/IEC SGBD meeting 13-16 May 2014Discussions and results Two days workshop 13-14 May 2014– EU and NL focus, UvA activities Refining Big Data technology definition and BigData Architecture definition New items proposed– Big Data market aspects– Data ownership (including during data lifecycle/stagingand aggregation)– Opacity (obfuscation) data linkage during processing– Data linkageBDDAC2014 @CTS2014Big Data Architecture Framework20

From Big Data to All-Data – Paradigm Change Breaking paradigm changing factor– Data storage and processing– Security– Identification and provenance Traditional model– BIG Storage and BIG Computer withFAT pipe– Move compute to data vs Move datato computeMove or notto move?Big DataNetwork?Distributed Big DataStorageData AbstractionData Bus New Paradigm– Continuous data production– Continuous data processing– DataBus as a Data container andProtocolBDDAC2014 Infrastructure AbstractionDistributed Computeand AnalyticsBig Data Architecture FrameworkDataBus:(1) Data Container(2) Metadata, State(3) Data TransferProtocol21

Moving to Data-Centric Models and Technologies Current IT and communication technologies arehost based or host centric– Any communication or processing are bound to host/computer thatruns software– Especially in security: all security models are host/client based Big Data requires new data-centric models–––––Data location, search, accessData integrity and identificationData lifecycle and variabilityData centric (declarative) programming modelsData aware infrastructure to support new data formats and datacentric programming models Data centric security and access controlBDDAC2014 @CTS2014Big Data Architecture Framework22

Defining Big Data Architecture Framework Existing attempts address architecture issues in a traditionalway: ODCA, TMF, NIST– http://bigdatawg.nist.gov/ uploadfiles/BD Vol5 RefArchSurvey V1Draft Prerelease.pdf Architecture vs Ecosystem– Big Data undergo a number of transformations during their lifecycle– Big Data fuel the whole transformation chain Data sources and data consumers, target data usage– Multi-dimensional relations between Data models and data driven processes Infrastructure components and data centric services Architecture vs Architecture Framework– Separates concerns and factors Control and Management functions, orthogonal factors– Architecture Framework components are inter-relatedBDDAC2014 @CTS2014Big Data Architecture Framework23

Big Data Architecture Framework (BDAF) (1)(1) Data Models, Structures, Types– Data formats, non/relational, file systems, etc.(2) Big Data Management– Big Data Lifecycle (Management) Model Big Data transformation/staging– Provenance, Curation, Archiving(3) Big Data Analytics and Tools– Big Data Applications Target use, presentation, visualisation(4) Big Data Infrastructure (BDI)– Storage, Compute, (High Performance Computing,) Network– Sensor network, target/actionable devices– Big Data Operational support(5) Big Data Security– Data security in-rest, in-move, trusted processing environmentsBDDAC2014 @CTS2014Big Data Architecture Framework24

Big Data Architecture Framework (BDAF) –Aggregated – Relations between components (2)Col: Used ByRow: RequiresThisDataModelsStructrsData Models& StructuresDataManagmnt& LifecycleBigDataInfrastr &OperationsBigDataBigDataAnalytics & SecurityApplicatn DataManagmnt &Lifecycle BigDataInfrastruct &Operations BigDataAnalytics &Applications BigDataSecurity BDDAC2014 @CTS2014Big Data Architecture Framework 25

onConsumerBig Data Ecosystem: General BD InfrastructureData Transformation, Data ManagementDataDelivery,VisualisationBig Data Target/Customer: Actionable/Usable DataTarget users, processes, objects, behavior, Big Data Source/Origin (sensor, experiment, logdata, behavioral data)Big Data urposeHighPerformanceComputerClustersBig Data Infrastructure Heterogeneous multi-providerinter-cloud infrastructure Data managementinfrastructure Collaborative Environment(user/groups managements) Advanced high performance(programmable) network Security analytics DB,In memory,operstional)categories: metadata,(un)structured, (non)identifiableData Management non)identifiableIntercloud multi-provider heterogeneous InfrastructureSecurity InfrastructureBDDAC2014 @CTS2014Network itoringBig Data Architecture Framework26

Big Data Infrastructure and Analytics ToolsBig Data Infrastructure Heterogeneous multi-providerinter-cloud infrastructure Data managementinfrastructure Collaborative Environment(user/groups managements) Advanced high performance(programmable) network Security infrastructureBig Data Analytics High Performance ComputerClusters (HPCC) Analytics/processing: Realtime, Interactive, Batch,Streaming Big Data Analytics tools andapplicationsBDDAC2014 @CTS2014Big Data Architecture Framework27

Data Lifecycle/Transformation ModelCommon Data Model? Data Variety and Variability Semantic InteroperabilityData Model (1)Data Model (1)Data Model (4)Data (inter)linking? PID/OID ORCID Identification Privacy, OpacityData ataDelivery,VisualisationConsumerData AnaliticsApplicationDataSourceData Model (3)Data repurposing,Analitics re-factoring,Secondary processing Does Data Model changes alonglifecycle or data evolution?Identifying and linking dataBDDAC2014 @CTS2014––––Persistent identifiersData ownershipTraceability vs OpacityReferral integrityBig Data Architecture Framework28

Evolutional/Hierarchical Data ModelActionable DataPapers/ReportsArchival DataUsable DataORCIDProcessed Data (for target use)Processed Data (for target use)Processed Data (for target use)PID/DOIClassified/Structured DataClassified/Structured DataClassified/Structured DataRaw DataTopics for discussion, research andstandardisationBDDAC2014 @CTS2014 Common Data Model?Data interlinking?Fits to Graph data type?Met

NIST Big Data Working Group – Contribution to Reference Architecture, Big Data Definition and Taxonomy, Big Data Security Research Data Alliance (RDA) – Interest Group on Education and Skills Development on Data Intensive Science – Big Data Analytics Interest Group