0 Introduction - Indiana University Bloomington

Transcription

B561 Advanced DatabaseConcepts§0 IntroductionQin Zhang1-1

Self introduction: my research interests Algorithms for Big Data:streaming/sketching algorithms;algorithms on distributed data;I/O-efficient algorithms;data structures; Complexity:communication complexity.I am a theoretician, and occasionally work ondatabases and data mining2-1

Self introduction: my research interests Algorithms for Big Data:streaming/sketching algorithms;algorithms on distributed data;I/O-efficient algorithms;data structures; Complexity:communication complexity.I am a theoretician, and occasionally work ondatabases and data miningYou may ask: “why do you teach databases(and probably make our lives harder)”?2-2

Self introduction: my research interests Algorithms for Big Data:streaming/sketching algorithms;algorithms on distributed data;I/O-efficient algorithms;data structures; Complexity:communication complexity.I am a theoretician, and occasionally work ondatabases and data miningYou may ask: “why do you teach databases(and probably make our lives harder)”?Hope you will not ask me again after this course :)I am learning together with you.2-3

What does a typicalundergrad databasecourse cover?3-1

How to represent data?How to represent the data in the computer?TitleStar WarsMighty DucksWayne’s ightyDucks,1991,.Wayne’sWorld,1992,.

How to represent data?How to represent the data in the computer?TitleStar WarsMighty DucksWayne’s ,.MightyDucks,1991,.Wayne’sWorld,1992,.

How to operate on data?Given the data, say, a set of tables, how to answer queries?Difficulty: Queries may depend crucially on the data in all hiStockPrice256515CountryUSAJapanJapanQ: Find all products under price 200 manufactured in Japan?5-1

How to operate on data? oWorksCanonHitachiCompany ix.PName, x.PriceProduct x, Company yx.Manufacturer y .CNamey .Country ‘Japan’x.Price 200StockPrice256515CountryUSAJapanJapan Relational AlgebraπPName, Price(σPrice 200 Country ‘Japan’ (Product 1Manufacturer CName Company))6-1

How to speed up the operation?How to speed up the operation?Relational operations can sometimes be computed muchfaster if we have precomputed a suitable data structure onthe data. This is called Indexing.7-1

How to speed up the operation?How to speed up the operation?Relational operations can sometimes be computed muchfaster if we have precomputed a suitable data structure onthe data. This is called Indexing.Most notably, two kinds of index structures are essential todatabase performance:1. B-trees.2. External hash tables.For example, hash tables may speed up relationaloperations that involve finding all occurrences in a relationof a particular value.7-2

How to make a good operation plan?How to optimize the orders of the operations?R(A, B, C , D), S(E , F , G )Find all pairs (x, y ), x R, y S such that(1) x.D y .E , (2) x.A 5 and (3) y .G 9σA 5 G 9 (R 1D E S)Q: Use the LHS or RHS?8-1 σA 5 (R) 1D E σG 9 (S)

How to deal with transactions?Transactions with the ideal ACID properties resolve thesemantic problems that arise when many concurrent usersaccess and change the same database. 9-1Atomicity ( recovery)ConsistencyIsolation ( concurrency control)Durability

How to deal with transactions?Transactions with the ideal ACID properties resolve thesemantic problems that arise when many concurrent usersaccess and change the same database. Atomicity ( recovery)ConsistencyIsolation ( concurrency control)DurabilityWe will talk about how transactions are implemented usinglocking and timestamp mechanisms.This knowledge is useful in database programming, e.g., itmakes it possible in some cases to avoid (or reduce)rollbacks of transactions, and generally make transactionswait less for each other.9-2

SummarizeDatabase Logic(express the query)System(implementation)Algorithm(solve the query)10-1

SummarizeDatabase Logic(express the query)System(implementation)Algorithm(solve the query)ImplementationConcept (our focus)10-2(see B662 Database Systemand Internal Design)

SummarizeDatabase Logic(express the query)Data Representation, RelationalAlgebra, SQL (Datalog), etc.System(implementation)Algorithm(solve the query)Indexing, Query Optimization,Concurrency Control, etc.ImplementationConcept (our focus)10-3(see B662 Database Systemand Internal Design)

SummarizeDatabase Logic(express the query)Data Representation, RelationalAlgebra, SQL (Datalog), etc.And you need math!!System(implementation)Algorithm(solve the query)Indexing, Query Optimization,Concurrency Control, etc.ImplementationConcept (our focus)10-4(see B662 Database Systemand Internal Design)

What’s more in this course?11-1

Advanced topicsBeyond ”SQL, Relational Algebra, Data Models,Storage, Views and Indexing, Query Processing, QueryOptimization, Transaction Recovery, ConcurrencyControl”I will give you a taste of1. Data Privacy (Data Suppression, DifferentialPrivacy)2. External Memory a.k.a. I/O-EfficientAlgorithms (Sorting, List Ranking)3. Streaming Algorithms (Sampling, Heavy Hitters,Distinct Elements)4. Data Integration / Cleaning (Deduplication)5. MapReduce12-1

Other important topics in databasesMore but probably will not cover1.2.3.4.Tree-based data models e.g., XMLGraph-based data models e.g., RDFSpatial databasesParallel and Distributed databasespartly covered in MapReduce5. Social Networks6. Uncertainty in databasesetc.13-1

Tentative course planPart 0 : IntroductionsPart 1 & 2 : Basics– SQL, Relational Algebra– Data Models, Storage, IndexingPart 3 : OptimizationPart 4 : TrasactionsPart 5 : Data PrivacyPart 6 : I/O-Efficient AlgorithmsPart 7 : Streaming AlgorithmsPart 8 : Data IntegrationPart 9 : MapReduce14-1

Tentative course planPart 0 : IntroductionsPart 1 & 2 : Basics– SQL, Relational Algebra– Data Models, Storage, IndexingPart 3 : OptimizationPart 4 : TrasactionsPart 5 : Data PrivacyPart 6 : I/O-Efficient AlgorithmsPart 7 : Streaming AlgorithmsPart 8 : Data IntegrationPart 9 : MapReduceWe will also have some student presentations at the end of the course14-2

ResourcesMain reference book (we will go beyond this) Database Systems: The Complete Bookby Hector Garcia-Molina, Jeff Ullmanand Jennifer Widom, 2nd EditionOther reference books (undergrad textbooks . ) Database Management Systemsby Ramakrishnan and Guhrke, 3rd Edition Database System Conceptsby UllSilberschatz, Korth and Sudarshan,6th Edition15-1

Resources (cont.)Other reference books (cont.): Readings in Database Systems “Red book”Hellerstein and Stonebraker, eds., 4th Edition(Will be one of our readings) Foundations of Databases: The Logical Level“Alice book”by Abiteboul, Hull, Vianu Concurrency Control and Recovery in DatabaseSystems aby Bernstein, Hadzilacos, Goodmana 16-1ccontrol.aspx

Resources (cont.)Other reference books (cont.): Algorithms and Data Structuresfor External Memory aby Vittera http://www.ittc.ku.edu/ jsv/Papers/Vit.IO book.pdf Data Streams: Algorithms and Applicationsby S. Muthukrishnana http://www.cs.rutgers.edu/17-1amuthu/stream-1-1.ps

Resources (cont.)Other reference books (cont.): Algorithms and Data Structuresfor External Memory aby Vittera http://www.ittc.ku.edu/ jsv/Papers/Vit.IO book.pdf Data Streams: Algorithms and Applicationsby S. Muthukrishnana e are surely not enough, and sometimes dated. Doyou want to learn more? Reading original papers!In fact, some of my slides are directly from VLDB toturials17-2

InstructorsInstructor: Qin ZhangEmail: qzhangcs@indiana.eduOffice hours: Tuesday 2:45pm-3:45pm(Lindley 215E temporary, then Lindley 430A)Associate Instructors: Erfan Sadeqi AzerLe LiuYifan PanAli VarameshPrasanth VelamalaOffice hours: Posted on course website18-1

GradingAssignments 50% : Three written assignments (each 10%).Solutions should be typeset in LaTeX(highly recommended) or Word.And one reading assignment (20%)(next slide for details)Selected/volunteer students will givepresentationsExams 50% : Mid-term (20%) and Final (30%).19-1

GradingAssignments 50% : Three written assignments (each 10%).Solutions should be typeset in LaTeX(highly recommended) or Word.And one reading assignment (20%)(next slide for details)Selected/volunteer students will givepresentationsExams 50% : Mid-term (20%) and Final (30%).Use A, B, . . . for each item (assignments, exams). Final gradewill be a weighted average (according to XX%).19-2

Reading assignmentOne or a group of two read some (1 to ) papers/surveys/articlesand write a report (4 pages for one, and 8 pages for a group of two)on what you think of the articles you read (not just a repeat of whatthey have said).Topics can be found in d more topics on the course website “More reading topics” (googlethe papers / surveys yourself; contact AI if you cannot find it).Selected students/groups (volunteer first) will give 25mins talks(20mins presentation 5mins Q&A) in class. The best 1/3individuals/groups will get a bonus in their final grades. A penaltywill be given if you agree to give a talk but cannot do at the end,while the quality of the talk is irrelevant.20-1

LaTeXLaTeX: Highly recommended tools forassignments/reports1. Read wiki articles:http://en.wikipedia.org/wiki/LaTeX2. Find a good LaTeX editor.3. Learn how to use it, e.g., read “A Not So ShortIntroduction to LaTeX 2e” (Google it)21-1

PrerequisiteParticipants are expected to have a background inalgorithms and data structures. For example, have taken1. C241 Discrete Structures for Computer Science2. C343 Data Structures3. B403 Introduction to Algorithm Design and Analysisor equivalent courses, and know some basics of databases.22-1

Frequently asked questionsIs this a course good for my job hunting in industry?Yes, if you get to know some advanced concepts in databases, thatwill certainly help.But, this is a course on theoretical foundations of databases, butnot designed for teaching commercially available techniquesand not a programming language (SQL? PHP?) course,and not a “hands on” course (this is not a course for professionaltraining; this is a graduate course in a major research universitythus should be much more advanced)23-1

Frequently asked questionsIs this a course good for my job hunting in industry?Yes, if you get to know some advanced concepts in databases, thatwill certainly help.But, this is a course on theoretical foundations of databases, butnot designed for teaching commercially available techniquesand not a programming language (SQL? PHP?) course,and not a “hands on” course (this is not a course for professionaltraining; this is a graduate course in a major research universitythus should be much more advanced)I haven’t taken B403 “Introduction to Algorithm Designand Analysis” or equivalent courses. Can I take thecourse? Or, will this course fit me?23-2Generally speaking, this is an advanced course. It will be difficultif you do not have enough background. You can take intoconsideration the touch-base exam.

The goal of this courseOpen / change yourviews of the world(of databases)24-1

The goal of this courseOpen / change yourviews of the world(of databases)Seriously, it is not just SQL programming.Read “The relational model is dead, SQL is dead,and I don’t feel so good myself”24-2

Big Data25-1

Big DataBig data is everywhere .26-1: over 2.5 petabytes of sales transactions: an index of over 19 billion web pages: over 40 billion of pictures

Big DataBig data is everywhere .: over 2.5 petabytes of sales transactions: an index of over 19 billion web pages: over 40 billion of picturesMagazine coversNature ’0626-2Nature ’08CACM ’08Economist ’10

Source and challengeSource Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey Telescope27-1

Source and challengeSource Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey TelescopeChallenge Volume Velocity Variety (Documents, Stock records, Personal profiles,Photographs, Audio & Video, 3D models, Location data, . . . )27-2

Source and challengeSource Retailer databases: Amazon, Walmart Logistics, financial & health data: Stock prices Social network: Facebook, twitter Pictures by mobile devices: iphone Internet traffic: IP addresses New forms of scientific data: Large Synoptic Survey TelescopeChallenge Volume Velocity} The main technical challenges Variety (Documents, Stock records, Personal profiles,Photographs, Audio & Video, 3D models, Location data, . . . )27-3

What does Big Data really mean?We don’t define Big Data in terms of TB, PB, EB, . . .The data is too big to fit in memory. What can we do?28-1

What does Big Data really mean?We don’t define Big Data in terms of TB, PB, EB, . . .The data is too big to fit in memory. What can we do?Store them in the disk, and read/write a block of data each time28-2

What does Big Data really mean?We don’t define Big Data in terms of TB, PB, EB, . . .The data is too big to fit in memory. What can we do?Store them in the disk, and read/write a block of data each timeProcessing one by one as they come,and throw some of them away on the fly.28-3

What does Big Data really mean?We don’t define Big Data in terms of TB, PB, EB, . . .The data is too big to fit in memory. What can we do?Store them in the disk, and read/write a block of data each timeProcessing one by one as they come,and throw some of them away on the fly.Store in multiple machines, which collaborate via communication28-4

What does Big Data really mean?We don’t define Big Data in terms of TB, PB, EB, . . .The data is too big to fit in memory. What can we do?Store them in the disk, and read/write a block of data each timeProcessing one by one as they come,and throw some of them away on the fly.Store in multiple machines, which collaborate via communicationRAM model does not fitRAMA processor and an infinite size memoryProbing each cell of the memory has a unit cost28-5CPU

Big Data:A marketing buzzword?29-1

Big Data:A marketing buzzword?A good reading topic29-2

Popular models for big data(see another slides)30-1

Summary for the introductionWe have discussed topics that will be covered in thiscourseWe have introduced some models for big datacomputation.We have talked about the course plan and assessment.31-1

Thank you!Questions?A few introductory slides are based on RasmusPagh’s slideshttp://www.itu.dk/people/pagh/ADBT06/32-1

Algorithms for Big Data: streaming/sketching algorithms; algorithms on distributed data; I/O-e cient algorithms; data structures; Complexity: communication complexity. I am a theoretician, and occasionally work on databases and data mining You may ask: \why do you teach databases (and probably make our lives harder)"?