Trends And Challenges In Database Development - UiO

Transcription

Trends and Challengesin Database DevelopmentEllen Munthe-KaasINF5100 Autumn 2006 Ellen Munthe-Kaas1

Database Trends1. The object/relationbattle2. Web Services3. Queues and workflows4. Cubes and onlineanalytic processing5. Data mining andmachine learning6. Column stores7. Approximate answers8. Semi-structured data9. The Semantic Web10. Stream and sensorprocessing11. Smart objects:Databases everywhere12. Publish-subscribe13. Massive memory,massive latency14. Self managing andalways upA selection of these subjects (and some additional ones)are discussed more thoroughly in the rest of the course.INF5100 Autumn 2006 Ellen Munthe-Kaas2

1. The Object/Relation Battle Objects? Relations? Object-Relational! The Object-Relational world:Marry programming languages and DBMSs ORDBMSs– Stored procedures evolve to ”real” languagesJava, C#,. with real object models– Encapsulated data: a class with methods– Tables are enumerable and indexable record sets with foreignkeys– Records are vectors of objects– Opaque or transparent types– Set operators on transparent classes– Ends Inside-DB Outside-DB dichotomyINF5100 Autumn 2006 Ellen Munthe-Kaas3

Example: SkyserverAstronomy data online Maps half of thenorthern sky For research onastronomicalobservations Started as an OODB(1995), migrated toORDB (2002)INF5100 Autumn 2006 Ellen Munthe-KaasSELECT TOP 1000g.run, f.field, p.objIDFROMTARGDR4.PhotoObj p,TARGDR4.Field f,TARGDR4.Segment gWHEREf.fieldid p.fieldidand f.segmentid g.segmentidand f.psfWidth r 1.2and p.colc 400.04

SkyserverData Release 5 (DR5) Imaging catalog:– Footprint area: 8000 sq. deg.– 215 mill. unique objects– Data volume: images: 9.0 TB catalogs (data archive server): 1.8 TB catalogs (SQL database): 3.6 TB Spectroscopic catalog: 1,048,960 spectra––––5740 sq. deg.674,749 galaxies79,394 quasars (redshift 2.3), 11,217 quasars (redshift 2.3)154,925 starsINF5100 Autumn 2006 Ellen Munthe-Kaas5

Skyserver Home Pagehttp://skyserver.sdss.org/INF5100 Autumn 2006 Ellen Munthe-Kaas6

2. Web Services Web service: SW system designed to supportinteroperable machine-to-machine interactionover a network– Web servers and runtime (Apache, IIS, J2EE, .NET)displaced TP monitors and ORBS– Web services (soap, wsdl, xml) are displacing currentbrokers– DBMS listening to port 80Publishing WSDL, DISCO, WS-SecurityServicing SOAP callsDBMS is a web service– Basis for distributed systems– A consequence of ORDBMSINF5100 Autumn 2006 Ellen Munthe-Kaas7

Example: OpenSkyQuery Cross-matching astronomical catalogs––––––29 archivesspatial data searchRaw Pixel data live in file serversCatalog data (derived objects) in databaseOnline SQLBased on Web Services Also used for education– 150 hours of online astronomy– implicitly teaches data analysisINF5100 Autumn 2006 Ellen Munthe-Kaas8

OpenSkyQuery Welcome Pagehttp://openskyquery.net/Sky/skysite/INF5100 Autumn 2006 Ellen Munthe-Kaas9

Example Query: Brown Dwarf SearchSELECT o.objId, o.ra,o.dec, o.type, t.objId,t.j m, o.zFROMSDSSDR2:PhotoPrimary o, TWOMASS:PhotoPrimary tWHERE XMATCH(o, t) 2.5 ANDRegion('CIRCLE J2000 16.031 -0.891 30') AND(o.z - t.j m) 2INF5100 Autumn 2006 Ellen Munthe-Kaas10

3. Queues and Workflows Applications loosely connected via queuedmessages Queues:– Supported in all major database systems defining queues queueing and dequeueing messages attaching triggers to queues– Basis for publish-subscribe and workflow Challenges: How to structure workflows andnotifications; characterize design patternsINF5100 Autumn 2006 Ellen Munthe-Kaas11

4. Cubes andOnline Analytical Processing OLAP: Approach to quickly provide theanswer to analytical queries that aredimensional in nature–––– salesmarketingmanagement reporting.Databases for OLAP contain data cubes– Data cubes now standard– MDX is very powerful(Multi-Dimensional eXpressions)– Cube stores cohabit with row storesROLAP MOLAP xOLAPSELECT axis spec FROM cube spec WHERE slicer spec FOCHRDEVY0199 99112199 9319REDWHITEBLUE(relational multidimensional .OnLine Analytical Processing)– Very sophisticated algorithms Challenge: Better ways to query andvisualize cubesINF5100 Autumn 2006 Ellen Munthe-Kaas12

5. Data Mining andMachine Learning Tasks: Classification, association, prediction Tools: Decision trees, Bayesian networks, apriori clustering, regression, neural nets, . Now unified with DBs– Create table T(x,y,z,a,b,c)Learn ”a,b,c” from ”x,y,z” using algorithm – Train T with data– Then can ask: Probability (?x,?y,?z,?a,?b,?c) Probability (x,y,z,?a,?b,?c)– Example: Learn height from age Challenge: Better learning algorithmsINF5100 Autumn 2006 Ellen Munthe-Kaas13

Data . Data mining– Process of automatically searching large volumes of data for patterns– Applies computational techniques from statistics, information retrieval,machine learning and pattern recognition Data farming– Process of using a high performance computer or computing grid to runa simulation thousands or millions of times across a large parameterand value space– Result is a “landscape” of output that can be analyzed for trends,anomalies, and insights in multiple parameter dimensions. Data warehouse– Collection of computerised data organised to most optimally supportreporting and analysis activity Data mart– Specialized version of a data warehouse for specific user groups orneedsINF5100 Autumn 2006 Ellen Munthe-Kaas14

Data Mining – Database Synergy Create the model: Learn height from Gender AgeCREATE MINING MODEL HeightFromAgeSex(ID long key,Gender text discrete,Age long continuous,Height long continuous PREDICT)USING Decision TreesTrain a data mining model: Database verbs to drive ModelerINSERT INTO HeightSELECT ID, Gender, Age, HeightFROM PeoplePredict height from model: Probabilistic reasoningSELECT height,PredictProbability(height)FROM Height PREDICTION JOIN NewON New.Gender Height.GenderAND New.Age Height.AgeINF5100 Autumn 2006 Ellen Munthe-Kaas15

6. Column Stores Universal relations: Users see fat base tables Ex. LDAP– 7 required, thousands of measured attributes Conceptually simple, but use only some columns To avoid reading useless data,–––––do vertical partitionsdefine 10% popular columns indexmake many skinny indicesquery engines uses covering indexmuch faster read, slower insert/update Column stores automate this Challenge: Automate designINF5100 Autumn 2006 Ellen Munthe-Kaas16

7. Approximate Answers ”Messy” data types: Text, time, space– Integrating programming languages withDBMS allows adding data types and librariesfor indexing and accessing such data– Approximate answers– Probabilistic reasoning– No clear algebraINF5100 Autumn 2006 Ellen Munthe-Kaas17

8. Semi-Structured Data”Cyberspace is a giantXML document:xQuery for manipulation” Not all data fits into the relational model– XML – eXtensible Markup Language File directories are becoming databases– Pivot on any attribute– Folders are standing queries– Freetext schema search (better precision/”Structure: YES!recall)Semi-structured: NO!”INF5100 Autumn 2006 Ellen Munthe-Kaas18

9. The Semantic Web Today’s World Wide Web content:– Designed for humans to read– Can be parsed for layout and routineprocessing– Data hidden in HTML files:Useful in some context, but useless inothers– Consequence: Low precision Ex.: Search for birds of family Panurus,only knowing its English name. How to obtain more precision?– Adding semantics to the World WideWebINF5100 Autumn 2006 Ellen Munthe-Kaas19

Adding Semantics to WWW Documents ”marked up” with semanticinformation– Extension of HTML meta tags Machine-readable information (metadata) abouthuman-readable content of the document ”Pure” metadata representing a set of facts Common metadata vocabularies (ontologies)– For marking up documents in an agreed way Automated agents– Perform tasks for users of the Semantic Web– Use provided metadata Web-based services– To supply information specifically to agents E.g., a Trust service:Has an online store a history ofpoor service or spamming? INF5100 Autumn 2006 Ellen Munthe-KaasDatabase support needed!20

Standards and Tools URI – Uniform Resource Identifier– For identifying resources uniquely XML – eXtensible Markup Language– Surface syntax for structured documents– No semantic constraints on the documents XMLS – XML Schema– Language for restricting the structure of XML documents RDF – Resource Description Framework– Simple data model for referring to resources and how they are related– An RDF-based model can be represented in XML syntax RDFS – RDF Schema– Vocabulary for describing properties and classes of RDF resources– Data model for class hierarchies OWL – Web Ontology Language– Vocabulary for describing further class and relationship propertiesINF5100 Autumn 2006 Ellen Munthe-Kaas21

Example: Museo Suomi The Portal MuseumFinland:Finnish museums on the Semantic Web– Making culturalcollectionsavailable andsemanticallyinteroperablethrough WWWINF5100 Autumn 2006 Ellen Munthe-Kaashttp://www.museosuomi.fi/22

Topic Maps Standard for representation and interchange ofinformation– Provides a model and grammar for representing thestructure of information resources Emphasis on findability of the information XTM – XML Topic maps– XML-based interchange syntax Not a language for providing formal ontologieslike RDF and OWL– Deliberately supports inconsistencies!INF5100 Autumn 2006 Ellen Munthe-Kaas23

Example: Apollon University of Oslo’spopular sciencemagazine– Paper version four timesa year– Web resource Semantic portal usingTopic Map technology Associative links for easycross-article, topic-basedbrowsing and searchINF5100 Autumn 2006 Ellen Munthe-Kaas24

Apollon Portalhttp://www.apollon.uio.no/INF5100 Autumn 2006 Ellen Munthe-Kaas25

10. Stream andSensor Processing Data generated by instruments thatmonitor the environment– Need to process/analyze streams of data– Traditionally: Query large amounts of facts– Streams: Large amounts of queries on each new factINF5100 Autumn 2006 Ellen Munthe-Kaas26

Streams Implications:– New aggregation operators– New programming style– Streams in products: Queries represented as records New query optimizations Lots of challenges– Data structures, query operators, executionenvironments are qualitatively different fromclassical DBMS architecturesINF5100 Autumn 2006 Ellen Munthe-Kaas27

Sensor NetworksBase station(gateway)Motes (sensors)INF5100 Autumn 2006 Ellen Munthe-Kaas28

Sensor Network Characteristics Autonomous nodes– Small, low-cost, low-power, multifunctional– Sensing, data processing, and communicatingcomponents Sensor network is composed of large number ofsensor nodes (motes, smart dust)– Proximity to physical phenomena Deployed inside the phenomenon or very close to it Monitoring and collecting physical data– Streams of data No human interaction for weeks at a time– Long-term, low-power natureINF5100 Autumn 2006 Ellen Munthe-Kaas29

Sensor Data Harvesting Optimize wrt. power and bandwidth– Push queries out to sensors Moving intelligence to the perifery of the network Every mote and smart dust a small database initself– Aggregate results during data collection Much more dynamic query optimizationstrategies neededINF5100 Autumn 2006 Ellen Munthe-Kaas30

11. Smart Objects:Databases Everywhere Phones, PDAs, Cameras, . have small DBs– even motes– and smart dust? Disk drives have enough cpu, memory to run afull-blown DBMS All these devices want/need to share data Need a simple-but-complete DBMS– They need an ”Esperanto”:a data exchange language and paradigm Billions of clients, million of serversINF5100 Autumn 2006 Ellen Munthe-Kaas31

12. Publish-Subscribe Data with many users– Data warehouses collect vast data archives andpublish subsets to special interest group data-marts– Replicas for availability and/or performance– Mobile users do local updates, synchronize later Publish-subscribe model:– Custom subscriptions installed at the warehouse– Real-time notificationINF5100 Autumn 2006 Ellen Munthe-Kaas32

Publish-Subscribe andStream Processing Compare publish-subscribe & streamprocessing systems:– Millions of standing queries (subscriptions)compiled into dataflow graph– At arrival of new data, incrementally evaluatedataflow graph Challenge:– Support more sophisticated standing queries– Better optimization techniquesINF5100 Autumn 2006 Ellen Munthe-Kaas33

13. Massive Memory,Massive Latency 2005: RAM costs 100k - 300k per TByte Main-memory databases! Latency a problem– TByte ram memory scan minutes– TByte disk scan hours– Database engines need tooverhaul their algorithms1,E 41,E 31,E 2GB/kEUR NUMA latency a problem Challenge: Algorithms formassive main memoryStorage Price vs TimeMegabytes per kilo-Euro1,E 11,E 01,E-11,E-21,E-31,E-41980199020002010YearINF5100 Autumn 2006 Ellen Munthe-Kaas34

14. Self Managing andAlways Up No DBAs for cell phones or cameras– nor for panel heaters, washing machines,. Self* is the f-organizingSelf-healingSelf-. Requires a modular software architecture– Clear and simple knobs on modules– Software manages these knobsINF5100 Autumn 2006 Ellen Munthe-Kaas35

But What Happened to theClassical Databases? Classical databases are alive and kicking!Classical requirements are still valid!– Persistent datamanagement– Concurrency control– Availability /fault tolerance– Ad-hoc queries– Data integration– Logical and physicaldata independence –––––––Data consistencyData securityDistributionPerformanceExtensibilityCost effectivenessSimplemanageabilityClassical database application domainsand classical database functionality doNOT disappearINF5100 Autumn 2006 Ellen TEMOPERATING SYSTEMDATABASE36

Classical Application Domains Traditional database technology emerged frombookkeeping and s data processingProduction managementCASECAD/CAMERPCRM. Relational databases stand firmINF5100 Autumn 2006 Ellen Munthe-Kaas37

Emerging Application Domains Data warehousingOLAPData miningGISeCommerceMultimedia databases.INF5100 Autumn 2006 Ellen Munthe-Kaas Mobile databases Scientific ySeismologyMeteorologyMusic38

Database Trend SummaryAccording to Subject1.2.3.4.5.6.7.8.Database system implementationDatabase modelInteraction modelData accessibilityData yellowpagingAccessing the dataData processingDatabase integrationINF5100 Autumn 2006 Ellen Munthe-Kaas39

1. Database SystemImplementationMore efficient DBMSs Column stores– Dealing with sparse data in an efficient way Stream and sensor processing– Dealing with severe power, bandwidth, and memoryrestrictions– Dealing with evolving data Massive memory, massive latency– Dealing with new memory and disk technologiesINF5100 Autumn 2006 Ellen Munthe-Kaas40

2. Database ModelData model paradigm. Interaction model The object/relation battle – The object-relational model– ”Real” programming languages Approximate and probabilisticreasoning – Models for data that cannot beexpected to provide exact answers Stream and sensor processing– Model for evolving dataINF5100 Autumn 2006 Ellen Munthe-KaasData warehouses and data marts– Multidimensional model– Data organized for optimallysupporting reporting and analysisSmart objects: Databaseseverywhere– Common data model – Multidimensional data model– Data organized for fast complexanalytical and ad-hoc queriesSemi-structured data– Liberation from the O/R modelCubes and OLAPQueues and workflows– Workflows rather than RPC styleinteraction modelPublish-subscribe– Data dispatched according to apush model41

3. Data AccessibilityPreparing data for access- making sure relevant data is near by Web services– Realizing federated heterogeneous systems Queues and workflows– Using queues to obtain more loosely coupled systems Data warehouses and data marts– Collections of derived data The Semantic Web– ”Explaining” data through metadataINF5100 Autumn 2006 Ellen Munthe-Kaas42

4. Data YellowpagingWhat kinds of data are availableHow to obtain it Web services– Announcing available data– Describing how to obtain it The Semantic Web– Ontologies describing data semanticallyINF5100 Autumn 2006 Ellen Munthe-Kaas43

5. Accessing the DataGetting hold of the data Queues and workflows– Asynchronous communication– Realizing delay-tolerant networks The Semantic Web– Automated agents performing tasks for users– Web-based services supplying info to agents Publish-subscribe– Dispatching evolving dataINF5100 Autumn 2006 Ellen Munthe-Kaas44

6. Data ProcessingClimbing the value chain- from data to information to knowledge to wisdom Cubes and OLAP– Fast data analysis– For better businessdecisions Data mining and machinelearning– Knowledge discovery indatabases Approximate answers– Text retrieval, spatiotemporal data analysis– Approximate andprobabilistic reasoning The Semantic Web– Obtaining higher precisionand quality on data retrival Data farming– Simulations for betteranalysisINF5100 Autumn 2006 Ellen Munthe-Kaas45

7. Database IntegrationUtilizing database functionality in larger systems Queues and workflows– Supporting business processes– Part of ERP systems Data warehouses and data marts Self managing and always up– Embedded databases– Simplifying life for the uninformed userINF5100 Autumn 2006 Ellen Munthe-Kaas46

Literature Jim Gray: The Next Database Revolution,Proc. 2004 ACM SIGMOD InternationalConference on Management of Data(Available through the ACM Digital Library;cf. http://x-port.uio.no)INF5100 Autumn 2006 Ellen Munthe-Kaas47

Java, C#,. with real object models - Encapsulated data: a class with methods - Tables are enumerable and indexable record sets with foreign keys - Records are vectors of objects - Opaque or transparent types - Set operators on transparent classes - Ends Inside-DB Outside-DB dichotomy