Evolution Of Database Technology - Centrum Wiskunde & Informatica

Transcription

Database TechnologyErwin M. Bakker & Stefan Manegoldhttps://homepages.cwi.nl/ manegold/DBDM/http://liacs.leidenuniv.nl/ akker@liacs.leidenuniv.nl9/21/18Databases and Data Mining1

Evolution of DatabaseTechnology 1960s: 1970s: Relational data model, relational DBMS implementation1980s: 9/21/18(Electronic) Data collection, database creation, IMS(hierarchical database system by IBM) and network DBMSRDBMS, advanced data models (extended-relational, OO,deductive, etc.)Application-oriented DBMS (spatial, scientific,engineering, etc.)Databases and Data Mining2

Evolution of DatabaseTechnology 1990s: 9/21/18Data mining, data warehousing, multimedia databases, and Webdatabases2000 Stream data management and mining Data mining and its applications Web technology Data integration, XML Social Networks (Facebook, etc.) Cloud Computing global information systems Emerging in-house solutions In Memory Databases Big DataDatabases and Data Mining3

1960’s Companies began automating their back-officebookkeeping in the 1960sCOBOL and its record-oriented file model werethe work-horses of this effortTypical work-cycle:1.2.3. 9/21/18a batch of transactions was applied to theold-tape-mastera new-tape-master producedprintout for the next business day.COmmon Business-Oriented Language(COBOL 2002 standard)Databases and Data Mining4

COBOLA quote by Prof. dr. E.W. Dijkstra(Turing Award 1972) 18 June 1975:“The use of COBOL cripples themind; its teaching should, therefore,be regarded as a criminal offence.”September 2015:9/21/18Databases and Data Mining5

COBOL Code (just an example!)01 LOAN-WORK-AREA.03 LW-LOAN-ERROR-FLAGPIC03 LW-LOAN-AMTPIC03 LW-INT-RATEPIC03 LW-NBR-PMTSPIC03 LW-PMT-AMTPIC03 LW-INT-PMTPIC03 LW-TOTAL-PMTSPIC03 LW-TOTAL-INTPIC*004000-COMPUTE-PAYMENT.*MOVE 0 TO LW-LOAN-ERROR-FLAG.9(01)COMP.9(06)V9(02) COMP.9(02)V9(02) COMP.9(03)COMP.9(06)V9(02) COMP.9(01)V9(12) COMP.9(06)V9(02) COMP.9(06)V9(02) COMP.IF (LW-LOAN-AMT ZERO)OR(LW-INT-RATE ZERO)OR(LW-NBR-PMTS ZERO)MOVE 1 TO LW-LOAN-ERROR-FLAGGO TO 004000-EXIT.COMPUTE LW-INT-PMT LW-INT-RATE / 1200ON SIZE ERRORMOVE 1 TO LW-LOAN-ERROR-FLAGGO TO 004000-EXIT.9/21/18Databases and Data Mining6

1970’s Online Databases: Transition from handling transactions indaily batches to systems that managed an on-line databasethat captures transactions as they happened. At first these systems were ad hoc Late in the 60’s, "network" and "hierarchical" databaseproducts emerged. A network data model standard was defined by the databasetask group (DBTG), which formed the basis for mostcommercial systems during the 1970’s. In 1980 DBTG-based Cullinet was the leading softwarecompany.9/21/18Databases and Data Mining7

Network Model 9/21/18hierarchical model: a tree of records, with each recordhaving one parent record and many childrennetwork model: each record can have multiple parent andchild records, i.e., a lattice of recordsDatabases and Data Mining8

Historical PerspectiveIBM’s DBTG problems: DBTG used a procedural language that was low-levelrecord-at-a-time The programmer had to navigate through thedatabase, following pointers from record to record If the database was redesigned, then all the oldprograms had to be rewritten9/21/18Databases and Data Mining9

The "relational" data modelThe "relational" data model, by Ted Codd (Turing Award 1981) in hislandmark 1970 article “A Relational Model of Data for Large SharedData Banks", was a major advance over DBTG. The relational model unified data and metadata only one formof data representation. A non-procedural data access language based on algebra or logic. The data model is easier to visualize and understand than thepointers-and-records-based DBTG model. Programs written in terms of the "abstract model" of the data,rather than the actual database design programs insensitive to changes in the database design.9/21/18Databases and Data Mining10

The "relational" data model success Both industry and university research communities embracedthe relational data model and extended it during the 1970s. It was shown that a high-level relational database querylanguage could give performance comparable to the bestrecord-oriented database systems. (!) This research produced a generation of systems and peoplethat formed the basis for IBM's DB2, Ingres, Sybase, Oracle,Informix and others.9/21/18Databases and Data Mining11

The "relational" data model successSQL The SQL relational database language wasstandardized between 1982 and 1986. By 1990, virtually all database systemsprovided an SQL interface (includingnetwork, hierarchical and object-orienteddatabase systems).9/21/18Databases and Data Mining12

Ingres at UC Berkeley in 1972Stonebraker (Turing award 2014), Rowe, Wong, and others: a relational database system, query language (QUEL)relational optimization techniquesstorage strategieswork on distributed databasesFurther work on: database inference active databases (automatic responsing) extensible databases.Ingres from Computer Associates and PostgreSQL9/21/18Databases and Data Mining13

IBM: System RCodd's relational model was very controversial: too simplistic could never give good performance. a 10-person IBM Research effort to prototype a relational system a prototype, System R (evolved into the DB2 product)Defined the fundamentals on: query optimization data independence (views) transactions (logging and locking) security (the grant-revoke model).Note: SQL from System R became more or less the standard. 9/21/18The System R group further research: distributed databases (R*) object-oriented extensible databases (Starburst).Databases and Data Mining14

The Big NewDatabase Applications of 1990's EOSDIS (Earth Observing System Data and InformationSystem) Electronic Commerce Health-Care Information Systems Digital Publishing Collaborative Design9/21/18Databases and Data Mining15

EOSDIS (Earth Observing System Dataand Information System)Challenges: On-line access to petabytesized databases andmanaging tertiary storageeffectively. Supporting thousands ofconsumers with very heavyvolume of informationrequests, including ad-hocrequests and standingorders for daily updates. Providing effectivemechanisms for browsingand searching for thedesired data,9/21/18Databases and Data Mining16

Electronic CommerceHeterogeneous information sourcesmust be integrated. For example,something called a "connector“ inone catalog may not be a"connector“ in a different catalog "schema integration“ is a wellknown and extremely difficultproblem.Electronic commerce needs: Reliable Distributed Authentication Funds transfer.9/21/18Databases and Data Mining17

Health-Care Information SystemsTransforming the healthcare industry to takeadvantage of what is nowpossible will have a majorimpact on costs, andpossibly on quality andubiquity of care as well.Problems to be solved: Integration of heterogeneous forms of legacy information. Access control to preserve the confidentiality of medical records. Interfaces to information that are appropriate for use by allhealth-care professionals.9/21/18Databases and Data Mining18

Digital Publishing Management and delivery of extremely large bodies of dataat very high rates. Typical data consists of very large objectsin the megabyte to gigabyte range (1990's) Delivery with real-time constraints. Protection of intellectual property, including cost-effectivecollection of small payments and inhibitions against resellingof information. Organization of and access to overwhelming amounts ofinformation.9/21/18Databases and Data Mining19

The Information SuperhighwayDatabases and database technologywill play a critical role in thisinformation explosion. AlreadyWebmasters (administrators ofWorld-Wide- Web sites) are realizingthat they are databaseadministrators 9/21/18Databases and Data Mining20

Support for Multimedia Objects (1990's) Tertiary Storage (for petabyte storage) New Data Types Tape silosDisk juke-boxesThe operations available for each type ofmultimedia data, and the resulting implementationtradeoffs.The integration of data involving several of thesenew types.Quality of Service timely and realistic presentation of the data?gracefully degradation service? Can we interpolateor extrapolate some of the data? Can we rejectnew service requests or cancel old ones? Content-Based Retrieval User Interface Support9/21/18Databases and Data Mining21

Conclusions of/for DB CommunityThe database research community has a foundational role in creating the technologicalinfrastructure from which database advancements evolve. New research mandate because of the explosions inhardware capability, hardware capacity, and communication(including the internet or "web“ and mobile communication). Explosion of digitized information require the solution tosignificant new research problems: support for multimedia objects and new data types distribution of information new database applications workflow and transaction management ease of database management and use9/21/18Databases and Data Mining22

New Research Directions (1990's) 9/21/18Problems associated with putting multimedia objects intoDBMSs: new data typesProblems involving new paradigms for distribution andprocessing of information.New uses of databases Data Mining Data Warehouses RepositoriesNew transaction models Workflow Management Alternative Transaction Models (long transactions)Problems involving ease of use and management ofdatabases.Databases and Data Mining23

“One Size Fits All”:An Idea Whose Time Has Come and Gone.M. Stonebraker, U. CetintemelProceedingsofThe 2005 International Conferenceon Data EngineeringApril 2005http://ww.cs.brown.edu/ ugur/fits all.pdf9/21/18Databases and Data Mining24

DBMS: “One size fits all.”Single code line with all DBMS Services solves: Cost problem: maintenance costs of a single code line Compatibility problem: all applications will run against thesingle code line Sales problem: easier to sell a single code line solution to acustomer Marketing problem: single code line has an easier marketpositioning than multiple code line products9/21/18Databases and Data Mining25

Data Warehousing Early 1990’s: gather together data from multiple operational databasesinto a data warehouse for business intelligence purposes. Typically 50 or so operational systems, each with an online user community who expect fast response time. System administrators were (and still are) reluctant to allowbusiness-intelligence users onto the same systems, fearingthat the complex ad-hoc queries from these users willdegrade response time for the on-line community. In addition, business-intelligence users often want to seehistorical trends, as well as correlate data from multipleoperational databases. These features are very different fromthose required by on-line users.9/21/18Databases and Data Mining26

Data WarehousingData warehouses are very different from Online TransactionProcessing (OLTP) systems: 9/21/18OLTP systems: the main business activity is typically to sell a good orservice optimized for updatesData warehouse: ad-hoc queries, which are often quite complex. periodic load of new data interspersed with ad-hocquery activityDatabases and Data Mining27

Data WarehousingThe standard wisdom in datawarehouse schemas is tocreate a fact table:“who, what, when,where” about eachoperational transaction.9/21/18Databases and Data Mining28

Data Warehousing Data warehouse applications run muchbetter using bit-map indexes OLTP (Online Transaction Processing)applications prefer B-tree indexes. materialized views are a useful optimizationtactic in data warehousing, but not in OLTPworlds.9/21/18Databases and Data Mining29

Data WarehousingBitmapsAs a first approximation, mostvendors have a warehouse DBMS (bit-mapindexes, materialized views, starschemas and optimizer tactics forstar schema queries) ale015Male016Female10 OLTP DBMS (B-tree indexes anda standard cost-based optimizer),which are unitedby a commonparser9/21/18Databases and Data Mining30

Emerging ApplicationsSome other examples that show:Why conventional DBDMs willnot perform on the current emergingapplications.9/21/18Databases and Data Mining31

Emerging Sensor Based Applications 9/21/18Sensoring Army Battalion of 30000humans and 12000 vehicles x.10 6 sensorsMonitoring Traffic (InfraWatch, 2010)Amusements Park TagsHealth CareLibrary booksEtc.Databases and Data Mining32

Emerging Sensor Based Applications Conventional DBMSs willnot perform well on thisnew class of monitoringapplications. For example: Linear Road,traditional solutions arenearly an order ofmagnitude slower than aspecial purpose streamprocessing engine9/21/18Databases and Data Mining33

Example: financial-feed processingFinancial institutions subscribe to feeds that deliver real-timedata on market activity, specifically: News consummated trades bids and asks etc.For example: Reuters Bloomberg Infodyne9/21/18Databases and Data Mining34

Example: An existing application:financial-feed processingFinancial institutions have a variety of applications that processsuch feeds. These include systems that produce real-time business analytics, perform electronic trading (2014: High Frequency Trading) ensure legal compliance of all trades to the variouscompany and SEC rules compute real-time risk and market exposure to fluctuations in foreign exchange rates.The technology used to implement this class of applications isinvariably “roll your own”, because no good off-the-shelfsystem software products exist. (2005)9/21/18Databases and Data Mining35

Example: An existing application:financial-feed processingDetect Problems in Streaming stock ticks: Specifically, there are 4500 securities, 500 of which are “fastmoving”.Defined by rules: A stock tick on one of the fast securities is late if it occursmore than 5 seconds after the previous tick from the samesecurity. The other 4000 symbols are slow moving, and a tick is late if60 seconds have elapsed since the previous tick.9/21/18Databases and Data Mining36

Stream Processing9/21/18Databases and Data Mining37

Performance Implemented in the StreamBase stream processing engine(SPE) [5], a commercial, industrial-strength version of Aurora[8, 13]. On a 2.8Ghz Pentium processor with 512 Mbytes of memoryand a single SCSI disk, the workflow in the previous figurecan be executed at 160,000 messages per second, beforeCPU saturation is observed. In contrast, StreamBase engineers could only get 900messages per second using a popular commercial relationalDBMS.9/21/18Databases and Data Mining38

Why?: Outbound vs Inbound ProcessingRDBMS(Outbound Processing)9/21/18StreamBase(Inbound Processing)Databases and Data Mining39

Inbound Processing9/21/18Databases and Data Mining40

Outbound vs Inbound Processing DBMSs are optimized for outbound processing Stream processing engines are optimized for inboundprocessing. Although it seems conceivable to construct an engine that isoptimized for both inbound and outbound processing, such anengine design is clearly a research project.9/21/18Databases and Data Mining41

Other Issues:Correct Primitives for Streams SQL systems contain a sophisticated aggregation system, forexample a statistical computation over groupings of the recordsfrom a table in a database. When processing the last record in thetable the aggregate calculation for each group of records is emitted. However, streams can continue forever, there is no notion of “endof table”. Consequently, stream processing engines extend SQL withthe notion of time windows. In StreamBase, windows can be defined based on clock time,number of messages, or breakpoints in some other attribute.9/21/18Databases and Data Mining42

Other Issues: Integration of DBMS Processingand Application Logic (1/2) Relational DBMSs were all designed to have client-serverarchitectures. In this model, there are many client applications, which canbe written by arbitrary people, and which are thereforetypically untrusted. Hence, for security and reliability reasons, these clientapplications run in a separate address space from the DBMS.9/21/18Databases and Data Mining43

Other Issues: Integration of DBMS Processingand Application Logic (2/2) In an embedded processing model, itis reasonable to freely mix application logiccontrol logic andDBMS logicThis is what StreamBase does.9/21/18Databases and Data Mining44

Other Issues: High Availability It is a requirement of many stream-based applications tohave high availability (HA) and stay up 7x24. Standard DBMS logging and crash recovery mechanisms areill-suited for the streaming world The obvious alternative to achieve high availability is to usetechniques that rely on Tandem-style process pairs Unlike traditional data-processing applications that requireprecise recovery for correctness, many stream-processingapplications can tolerate and benefit from weaker notionsof recovery.9/21/18Databases and Data Mining45

Other Issues: Synchronization Traditional DBMSs use ACID transactions between concurrenttransactions submitted by multiple users for example toinduce isolation. (heavy weight) In streaming systems, which are not multi-user, a conceptlike isolation can be simply achieved by: critical sections,which can be implemented through light-weight semaphores.ACID Atomicity, Consistency, Isolation (transactions areexecuted in isolation), Durability9/21/18Databases and Data Mining46

One Size Fits All?9/21/18Databases and Data Mining47

One Size Fits All?Conclusions Data warehouses: store data by column rather than by row;read oriented Sensor networks: flexible light-way database abstractions, asTinyDB; data movement vs data storage Text Search: standard RDBMS too heavy weight and inflexible Scientific Databases: multi dimensional indexing, applicationspecific aggregation techniques XML: how to store and manipulate XML data9/21/18Databases and Data Mining48

9/21/18 Databases and Data Mining 3 Evolution of Database Technology 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000 - Stream data management and mining Data mining and its applications Web technology Data integration, XML Social Networks (Facebook, etc.) Cloud Computing global information systems