Lecture 5 SQL - Thodrek.github.io

Transcription

CS639:Data Management forData ScienceLecture 4: SQL for Data ScienceTheodoros Rekatsinas1

Announcements Assignment 1 is due tomorrow (end of day) Any questions?PA2 is out. It is due on the 19th Start early JAsk questions on PiazzaGo over activities and reading before attemptingOut of town for the next two lectures. We will resume on Feb 13th.2

Today’s Lecture1. Finish Relational Algebra (slides in previous lecture)2. Introduction to SQL3. Single-table queries4. Multi-table queries5. Advanced SQL3

1. Introduction to SQL4

SQL Motivation But why use SQL? The relational model of data is the most widely used model today Main Concept: the relation- essentially, a tableRemember: The reason for using therelational model is data independence!Logical data independence:protection from changes in thelogical structure of the dataSQL is a logical, declarative query language. We use SQL becausewe happen to use the relational model.

Basic SQL6

SQL Introduction SQL is a standard language for querying and manipulating data SQL is a very high-level programming language This works because it is optimized well!SQL stands forStructured Query Language Many standards out there: ANSI SQL, SQL92 (a.k.a. SQL2), SQL99 (a.k.a. SQL3), . Vendors support various subsetsProbably the world’s most successful parallelprogramming language (multicore?)

SQL is a Data Definition Language (DDL) Define relational schemata Create/alter/delete tables and their attributes Data Manipulation Language (DML) Insert/delete/modify tuples in tables Query one or more tables – discussed next!8

Tables in SQLProductPNamePriceManufacturerGizmo 19.99GizmoWorksPowergizmo 29.99GizmoWorksSingleTouch 149.99CanonMultiTouch 203.99HitachiA relation or table is amultiset of tupleshaving the attributesspecified by the schemaLet’s break thisdefinition down9

Tables in SQLProductPNamePriceManufacturerGizmo 19.99GizmoWorksPowergizmo 29.99GizmoWorksSingleTouch 149.99CanonMultiTouch 203.99HitachiA multiset is anunordered list (or: a setwith multiple duplicateinstances allowed)List:[1, 1, 2, 3]Set:{1, 2, 3}Multiset: {1, 1, 2, 3}i.e. no next(), etc. methods!10

Tables in SQLProductPNamePriceManufacturerGizmo 19.99GizmoWorksPowergizmo 29.99GizmoWorksSingleTouch 149.99CanonMultiTouch 203.99HitachiAn attribute (or column)is a typed data entrypresent in each tuple inthe relationAttributes must have an atomictype in standard SQL, i.e. not alist, set, etc.11

Tables in SQLProductPNamePriceManufacturerGizmo 19.99GizmoWorksPowergizmo 29.99GizmoWorksSingleTouch 149.99CanonMultiTouch 203.99HitachiAlso referred to sometimes as a recordA tuple or row is asingle entry in the tablehaving the attributesspecified by the schema12

Tables in SQLProductPNamePriceManufacturerGizmo 19.99GizmoWorksPowergizmo 29.99GizmoWorksSingleTouch 149.99CanonMultiTouch 203.99HitachiThe number of tuples isthe cardinality of therelationThe number ofattributes is the arity ofthe relation13

Data Types in SQL Atomic types: Characters: CHAR(20), VARCHAR(50) Numbers: INT, BIGINT, SMALLINT, FLOAT Others: MONEY, DATETIME, Every attribute must have an atomic type Hence tables are flat14

Table Schemas The schema of a table is the table name, its attributes, and theirtypes:Product(Pname: string, Price: float, Category:string, Manufacturer: string) A key is an attribute whose values are unique; we underline a keyProduct(Pname: string, Price: float, Category:string, Manufacturer: string)15

Key constraintsA key is a minimal subset of attributes that acts as aunique identifier for tuples in a relation A key is an implicit constraint on which tuples can be in the relation i.e. if two tuples agree on the values of the key, then they must bethe same tuple!Students(sid:string, name:string, gpa: float)1. Which would you select as a key?2. Is a key always guaranteed to exist?3. Can we have more than one key?

NULL and NOT NULL To say “don’t know the value” we use NULL NULL has (sometimes painful) semantics, more details laterStudents(sid:string, name:string, gpa: float)sid123nameBobgpa3.9143JimNULLSay, Jim just enrolled in his first class.In SQL, we may constrain a column to be NOT NULL, e.g., “name” in this table

General Constraints We can actually specify arbitrary assertions E.g. “There cannot be 25 people in the DB class” In practice, we don’t specify many such constraints. Why? Performance!Whenever we do something ugly (or avoid doing somethingconvenient) it’s for the sake of performance

Go over Activity 2-119

2. Single-table queries20

SQL Query Basic form (there are many many more bells and whistles)SELECT attributes FROM one or more relations WHERE conditions Call this a SFW query.21

Simple SQL Query: SelectionSelection is the operationof filtering a relation’stuples on some conditionPNamePriceCategoryManufacturerGizmo 19.99GadgetsGizmoWorksPowergizmo 29.99GadgetsGizmoWorksSingleTouch 149.99PhotographyCanonMultiTouch erGizmo 19.99GadgetsGizmoWorksPowergizmo 29.99GadgetsGizmoWorksSELECT *FROMProductWHERE Category ‘Gadgets’22

Simple SQL Query: ProjectionProjection is theoperation of producing anoutput table with tuplesthat have a subset of theirprior attributesPNamePriceCategoryManufacturerGizmo 19.99GadgetsGizmoWorksPowergizmo 29.99GadgetsGizmoWorksSingleTouch 149.99PhotographyCanonMultiTouch 203.99HouseholdHitachiSELECT Pname, Price, ManufacturerFROMProductWHERE Category ‘Gadgets’PNamePriceManufacturerGizmo 19.99GizmoWorksPowergizmo 29.99GizmoWorks23

NotationInput schemaProduct(PName, Price, Category, Manfacturer)SELECT Pname, Price, ManufacturerFROMProductWHERE Category ‘Gadgets’Output schemaAnswer(PName, Price, Manfacturer)24

A Few Details SQL commands are case insensitive: Same: SELECT, Select, select Same: Product, product Values are not: Different: ‘Seattle’, ‘seattle’ Use single quotes for constants: ‘abc’ - yes “abc” - no25

LIKE: Simple String Pattern MatchingSELECT *FROMProductsWHERE PName LIKE ‘%gizmo%’s LIKE p: pattern matching on stringsp may contain two special symbols: % any sequence of characters any single character26

DISTINCT: Eliminating DuplicatesSELECT DISTINCT yHouseholdCategorySELECT old27

ORDER BY: Sorting the ResultsSELECTFROMWHEREORDER BYPName, Price, ManufacturerProductCategory ‘gizmo’ AND Price 50Price, PNameTies are broken by thesecond attribute on theORDER BY list, etc.Ordering is ascending,unless you specify theDESC keyword.28

Go over Activity 2-229

3. Multi-table queries30

Foreign Key constraints Suppose we have the following schema:Students(sid: string, name: string, gpa: float)Enrolled(student id: string, cid: string, grade: string) And we want to impose the following constraint: ‘Only bona fide students may enroll in courses’ i.e. a studentmust appear in the Students table to enroll in a student idcidgrade123564A123537A student id alone is not akey- what is?We say that student id is a foreign key that refers to Students

Declaring Foreign KeysStudents(sid: string, name: string, gpa: float)Enrolled(student id: string, cid: string, grade: string)CREATE TABLE Enrolled(student id CHAR(20),cidCHAR(20),gradeCHAR(10),PRIMARY KEY (student id, cid),FOREIGN KEY (student id) REFERENCES Students(sid))

Foreign Keys and update operationsStudents(sid: string, name: string, gpa: float)Enrolled(student id: string, cid: string, grade: string) What if we insert a tuple into Enrolled, but no correspondingstudent? INSERT is rejected (foreign keys are constraints)! What if we delete a student?DBA chooses (syntax in the book)1. Disallow the delete2. Remove all of the courses for that student3. SQL allows a third via NULL (not yet covered)

Keys and Foreign non65JapanHitachi15JapanWhat is aforeign key vs.a key here?ProductPNamePriceCategoryManufacturerGizmo 19.99GadgetsGizmoWorksPowergizmo 29.99GadgetsGizmoWorksSingleTouch 149.99PhotographyCanonMultiTouch 203.99HouseholdHitachi34

JoinsProduct(PName, Price, Category, Manufacturer)Company(CName, StockPrice, Country)Ex: Find all products under 200 manufactured in Japan;return their names and prices.Note: we will often omitattribute types in schemadefinitions for brevity, butassume attributes arealways atomic typesSELECT PName, PriceFROMProduct, CompanyWHERE Manufacturer CNameAND Country ‘Japan’AND Price 20035

JoinsProduct(PName, Price, Category, Manufacturer)Company(CName, StockPrice, Country)Ex: Find all products under 200 manufactured in Japan;return their names and prices.SELECT PName, PriceFROMProduct, CompanyWHERE Manufacturer CNameAND Country ‘Japan’AND Price 200A join between tables returnsall unique combinations oftheir tuples which meetsome specified join condition36

JoinsProduct(PName, Price, Category, Manufacturer)Company(CName, StockPrice, Country)Several equivalent ways to write a basic join in SQL:SELECT PName, PriceFROMProduct, CompanyWHERE Manufacturer CNameAND Country ‘Japan’AND Price 200SELECT PName, PriceFROMProductJOINCompany ON Manufacturer CnameAND Country ‘Japan’WHERE Price 20037

JoinsProductPNamePriceCategoryManufGizmo 19GadgetsGWorksPowergizmo 29GadgetsGWorksSingleTouch 149 PhotographyCanonMultiTouch 203HitachiHouseholdSELECT PName, PriceFROMProduct, CompanyWHERE Manufacturer CNameAND Country ‘Japan’AND Price Hitachi15JapanPNamePriceSingleTouch 149.9938

Tuple Variable Ambiguity in Multi-TablePerson(name, address, worksfor)Company(name, address)SELECT DISTINCT name, addressFROMPerson, CompanyWHEREworksfor nameWhich “address” doesthis refer to?Which “name”s?39

Tuple Variable Ambiguity in Multi-TablePerson(name, address, worksfor)Company(name, address)Both equivalentways to resolvevariableambiguitySELECT DISTINCT Person.name, Person.addressFROMPerson, CompanyWHEREPerson.worksfor Company.nameSELECT DISTINCT p.name, p.addressFROMPerson p, Company cWHEREp.worksfor c.name40

Meaning (Semantics) of SQL QueriesSELECT x1.a1, x1.a2, , xn.akFROMR1 AS x1, R2 AS x2, , Rn AS xnWHERE Conditions(x1, , xn)Almost never the fastest wayto compute it!Answer {}for x1 in R1 dofor x2 in R2 do .for xn in Rn doif Conditions(x1, , xn)then Answer Answer È {(x1.a1, x1.a2, , xn.ak)}return AnswerNote: this is a multiset union41

An example of SQL semanticsSELECT R.AFROMR, SWHERE R.A putApplySelections /ConditionsA33ApplyProjectionA B C3 3 43 3 542

Note the semantics of a join1. Take cross product:𝑋 𝑅 𝑆SELECT R.AFROMR, SWHERE R.A S.BRecall: Cross product (A X B) is the set of allunique tuples in A,BEx: {a,b,c} X {1,2} {(a,1), (a,2), (b,1), (b,2), (c,1), (c,2)}2. Apply selections / conditions: Filtering!3. Apply projections to get final output: Returning only some attributes𝑌 𝑟, 𝑠 𝑋 𝑟. 𝐴 𝑟. 𝐵}𝑍 (𝑦. 𝐴, ) 𝑓𝑜𝑟 𝑦 𝑌Remembering this order is critical to understanding theoutput of certain queries (see later on )43

Note: we say “semantics” not “executionorder” The preceding slides show what a join means Not actually how the DBMS executes it under the covers

Go over Activity 2-345

4. Advanced SQL46

Set Operators and NestedQueries47

An Unintuitive QuerySELECT DISTINCT R.AFROMR, S, TWHERE R.A S.A OR R.A T.AWhat does it compute?STRComputes R Ç (S È T)But what if S f?Go back to the semantics!48

An Unintuitive QuerySELECT DISTINCT R.AFROMR, S, TWHERE R.A S.A OR R.A T.A Recall the semantics!1.2.3.Take cross-productApply selections / conditionsApply projection If S {}, then the cross product of R, S, T {}, and the query result {}!Must consider semantics here.Are there more explicit way to do set operations like this?49

What does this look like in Python?SSELECT DISTINCT R.AFROMR, S, TWHERE R.A S.A OR R.A T.A Semantics:R Ç (S È T)TR1. Take cross-productJoins / cross-products are just nested forloops (in simplest implementation)!2. Apply selections / conditionsIf-then statements!3. Apply projection50

What does this look like in Python?SSELECT DISTINCT R.AFROMR, S, TWHERE R.A S.A OR R.A T.AR Ç (S È T)TRoutput {}for r in R:for s in S:for t in T:if r[‘A’] s[‘A’] or r[‘A’] t[‘A’]:output.add(r[‘A’])return list(output)Can you see now what happens if S []?51

Multiset operations52

Recall Multisets𝝀 𝑿 “Count of tuple in X”(Items not listed haveimplicit count 0)Multiset XTupleMultiset X(1, a)Tuple𝝀(𝑿)(1, a)(1, a)2(1, b)1(2, c)3(1, d)2(1, b)(2, c)(2, c)(2, c)(1, d)(1, d)EquivalentRepresentationsof a MultisetNote: In a set allcounts are {0,1}.53

Generalizing Set Operations to MultisetOperationsMultiset XMultiset YMultiset ZTuple𝝀(𝑿)Tuple𝝀(𝒀)Tuple𝝀(𝒁)(1, a)2(1, a)5(1, a)2(1, b)0(1, b)1(1, b)0(2, c)3(2, c)2(2, c)2(1, d)0(1, d)2(1, d)0 𝝀 𝒁 𝒎𝒊𝒏(𝝀 𝑿 , 𝝀 𝒀 )For sets, this isintersection54

Generalizing Set Operations to MultisetOperationsMultiset XMultiset YTuple𝝀(𝑿)(1, a)2(1, b)0(2, c)(1, d)Multiset ZTuple𝝀(𝒀)Tuple𝝀(𝒁)(1, a)5(1, a)5(1, b)1(1, b)13(2, c)2(2, c)30(1, d)2(1, d)2 𝝀 𝒁 𝒎𝒂𝒙(𝝀 𝑿 , 𝝀 𝒀 )For sets,this is union55

sMultiset Operations in SQL56

Explicit Set Operators: INTERSECTSELECT R.AFROMR, SWHERE R.A S.AINTERSECTSELECT R.AFROMR, TWHERE R.A T.A𝑟. 𝐴 𝑟. 𝐴 𝑠. 𝐴 𝑟. 𝐴 𝑟. 𝐴 𝑡. 𝐴}Q1Q257

UNIONSELECTFROMWHEREUNIONSELECTFROMWHERER.AR, SR.A S.A𝑟. 𝐴 𝑟. 𝐴 𝑠. 𝐴 𝑟. 𝐴 𝑟. 𝐴 𝑡. 𝐴}R.AR, TR.A T.AQ1Q2Why aren’t thereduplicates?What if we wantduplicates?58

sUNION ALLSELECT R.AFROMR, SWHERE R.A S.AUNION ALLSELECT R.AFROMR, TWHERE R.A T.A𝑟. 𝐴 𝑟. 𝐴 𝑠. 𝐴 𝑟. 𝐴 𝑟. 𝐴 𝑡. 𝐴}Q1Q2ALL indicatesthe Multisetdisjoint unionoperation59

Generalizing Set Operations to Multis

We will resume on Feb 13 th. Today’s Lecture 1. Finish Relational Algebra (slides in previous lecture) 2. Introduction to SQL 3. Single-table queries 4. Multi-table queries 5. Advanced SQL 3. 1. Introduction to SQL 4. SQL Motivation But why use SQL? The relational model of datais the most widely used model today Main Concept: the relation-essentially, a table Logical data .