File Organization And Access Methods

Transcription

Unit 11File Organization andAccess Methods11-1

PART I: Overview DB2SQL (The Relational Model) (The Hierarchical Model) (The Network Model) PART II: (Database Design)E-R Model PART III: Wei-Pang Yang, Information Management, NDHU(Access Methods)(Database Recovery)(Concurrency Control)(Security and Integrity)(Query Optimization)(Distributed Database)Unit 11 File Organization and Access Methods11-2

PART III:(Access Methods): B-tree index. (Database Recovery):(Concurrency Control):(Security and Integrity): (Distributed Database):E-R Model:Generalization ;(Query Optimization): DBMSE-R ModelE-R Model(Extended E-R BCNFWei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-3

Contents of PART III: Unit 11 Access Methods Unit 12 Database Recovery Unit 13 Concurrency Control Unit 14 Security and Integrity Unit 15 Query Optimization Unit 16Unit 17Unit 18Unit 19Unit 20 Distributed DatabaseMore on E-R ModelMore on NormalizationMore on User InterfacesMore on X?References:1. C. J. Date, An Introduction to Database Systems, 8th edition, Addison-Wesley, 2004.2. A. Silberschatz, etc., Database System Concepts, 5th edition, McGraw Hill, 2006.3. J. D. Ullman and J. Widom, A First Course in Database Systems, 3rd edition, Prentice Hall, 2007.4. Cited papers ()Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-4

Contents 11.1 Introduction 11.2 Indexing 11.3 Hashing 11.4 Pointer Chains 11.5 Compression Techniques 11.6 Differential File OrganizationWei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-5

大量資料存取方法之研究Approaches to Access/Store Large Data國立交通大學資訊科學系所教授11-6

11.1 Introduction to Access Methods11-7

The Role of Access Method in DBMSQuery in SQL:SELECT CUSTOMER. NAMEFROM CUSTOMER, INVOICEWHERE REGION 'N.Y.' ANDAMOUNT 10000 ANDCUSTOMER.C# INVOICE.C#DBMSLanguage ProcessorInternal Form:P(s (SSP)OptimizerOperator:SCAN C using region index, create CSCAN I using amount index, create ISORT C?and I?on C#JOIN C?and I?on C#EXTRACT name fieldCalls to Access Method:OPEN SCAN on C with region indexGET next tuple.Calls to file system:GET10th to 25th bytes fromblock #6 of file #5Operator ProcessorAccess MethodFile SystemdatabaseWei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-8

The Internal Level Objectives:- concern the way the data is actually stored.- store data on direct access media. e.g. disk.- minimize the number of disk access (disk I/O).- disk access is much slower than main storage access time. MainStorage Structure/File Structure:- many different storage structures:CPUBuffer I/ODiskindexindex e.g indexing, hashing, pointer chains, - different structures have different performance no single structure is optimal for all applications. a good DBMS should support a variety of different structures (Access Methods)Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-9

The Internal Level (cont.) Physical database design: Process of choosing an appropriate storage representation for agiven database (by DBA). E.g. designing B-tree or hashing Nontrivial task require a good understanding of how the database will beused. Logical database design:(Unit 6,7)SPSP-SPSPdataWei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-10

11.2 Indexing (1)11-11

Indexing: Introduction Consider the Supplier table, S. Suppose "Find all suppliers in city xxx" is an important query.i.e. it is frequency executed. DBA might choose the stored representation as Fig. 11.2.City-Index (index)S (indexed AthensFig. 11.2: Indexing the supplier file on CITY.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-12

Indexing: Introduction (cont.) Now the DBMS has two possible strategies: 1 Search S, looking for all records with city 'xxx'. 2 Search City-Index for the desired entry. Advantage: speed up retrieval. index file is sorted. fewer I/O's because index file is smaller. Disadvantages: slow down updates. both index and indexed file should be updated.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-13

Indexing: Multiple Fields Primary index : index on primary key. e.g s# Secondary index: index on other field. e.g city A given table may have any number of risS4Clark20London30ParisS5Adams30Athens30Fig. 11.3: Indexing the supplier file on both CITY and STATUS.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-14

How Index are used?Consider:London 1 Sequential access :S (indexed file)accessed in the sequence defined byvalues of the indexed field.S1 . . London e.g Range query : "Find the suppliers whoseS2 . . Pariscity begins with a letter in the range L-R."LondonS3 . . ParisParisS4 . . LondonParisS5 . . AthensCity-IndexAthens 2 Direct Access : e.g "Find suppliers in London." e.g list query:"Find suppliers whose cityis in London, Paris, and N.Y." 3 Existence test : e.g "Is there any supplier in London ?"Note: It can be done from the index alone.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-15

Indexing on Field Combinations To construct an index on the basis of values of two or more fields. e.g. City/Status-IndexSAthens/30S1Smith20LondonLondon /20S2Jones10ParisLondon /20S3Blake30ParisParis /10S4Clark20LondonParis/30S5Adams30 AthensQuery: “Find suppliers in Paris with status 30.”- on city/status index: a single scan of a single index.- on two separate indexes: two index scan still difficult. (Fig. 11.3)Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-16

Query1: “Find suppliers in Paris with status 30”Query2: “Find suppliers with status 30”Query3: “Find suppliers in Paris”Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-17

Indexing on Field Combinations (cont.) Note 1. Combined city/status index can serve as an index onthe city field alone.2. In general, an index on the combination of fieldsF1 F2 .Fn can serve as indexes on- F1 alone- F1 F2 or F2 F1- F1 F2 F3 or any combinations.- Think : How many indexes can F1.Fn serve as ?Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-18

11.2 Indexing (2) Dense vs. Nondense IndexingB-tree and B -tree11-19

Dense V.S. Nondense Indexing Assume the Supplier file (S) is clustered on S# .Index (dense)Spage1S1.S1 S2 S3 S4 S5 S6page3page2S2.S4.S3.S5.S6.S S# SNAME STATUSCity indexS1S3S5A L L P LondonParisParisLondonAthensIndex (nondense)Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-20

Dense V.S. Nondense Indexing (cont.) Nondense index: not contain an entry for every record in theindexed file. retrieval steps: 1 scan the index (nondense) to get page # , say p. 2 retrieve page p and scan it in main storage. advantages: 1 occupy less storage than a corresponding dense index 2 quicker to scan. disadvantages: can not perform existence test via index alone. Note: At most only one nondense index can be constructed.(why?) Clustering: logical sequence physical sequenceWei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-21

B-tree Introduction: is a particular type of multi-level (or tree structured) index.proposed by Bayer and McCreight in 1972.the commonest storage structure of allin modern DBMS. Definition: (from Horowitz "Data Structure")A B-tree T of order m is an m-way search tree, such that 1 the root node has at least 2 children. 2 non-leaf nodes have at least [m/2] children. 3 all leave nodes are at the same level. Goal: maintain balance of index tree by dynamically restructuringthe tree as updates proceed.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-22

B-tree (cont.) Definition: (from Horowitz "Data Structure")A B-tree T of order m is an m-way search tree, such that 1 the root node has at least 2 children. 2 non-leaf nodes have at least [m/2] children. 3 all leave nodes are at the same level. Goal: maintain balance of index tree by dynamically restructuring the tree asupdates proceed.5082100123258708994100x100 10,00068 1215 18 3235 40 5051 52 5860 62 7071 78 8283 85 8991 93 9496 97 99100x100x100 1,000,000pointers to data recordsWei-Pang Yang, Information Management, NDHUpointers to data recordspointers to data recordsUnit 11 File Organization and Access Methods11-23

B -tree (Knuth's variation)5082index set1232587089 94(nondense)68 1215 18 3235 40 5051 52 5860 62 7071 78 8283 85 8991 93 9496 97 99Sequence set- index set: provides fast direct access to the sequential setand thus to the data too.- sequence set: provides fast sequential access to the indexed data.(with pointersto data records)(dense or nondense) Other variations: B*-tree, B'-tree,.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-24

11.3 Hashing 11.3.1 Dynamic Hashing 11.3.2 Extendible Hashing 11.3.3 Linear Hashing11-25

Hashing: Introduction Hashing (or Hash Addressing ) is a technique for providing fast direct access to a specific stored record on the basis of a given value for some fields.The field is usually but not necessarily the primary keyKey SpaceAddress Space (disk)h(k3) 38h(k) k mod 100k1 21823238h(k)k2 5199218282k3 3238h(k1) 82k4 6582Input:3238hash field kK5 6638Wei-Pang Yang, Information Management, NDHU38519999h(k2) 99h: map key set to address space.Unit 11 File Organization and Access Methods11-26

Hashing: Introduction (cont.) Basic Idea: Apply key-to-address transformation to determine in which bucket a recordshould be placed. partition storage space into buckets, each holds one or more records. handle bucket overflow to store: DBMS computes the hash address (RID or page #) for the new record. DBMS instructs the file manager to place the record at that position. to retrieve: given a key, the DBMS computes the hash address as before Using the computed address, DBMS instructs the file manager to fetch therecord. Advantages: fast. no space overhead as index method Disadvantages: physical sequence primary key sequence. Collisions: f(k1) f(k2), k1 k2Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-27

Address Transformation Algorithms Convert key value into value of appropriate magnitude. e.g 'Smith‘ Asc('s') Asc('m') Asc('i') Asc('t‘) Asc('h') Common algorithms: Division Method:Hash functionH(k) modulo (k /n)e.g. H(k) k mod 100 Mid-square method:H(k) central digits of K 2e.g. k 525 K 2 275625K1K2.K10n0000001F(x) x9999999Too waste!!m Others: . Observation:0F(k) k mod 120α n/m 0.8loading factor Division method is good enough.n 100120mWei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-28

Overflow Handling in Hashing Overflow chaining: allocate new bucket and chain to overflow ket.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-29

Overflow Handling in Hashing (cont.) Open addressing: make use of empty slots in the next bucket.901821h(471)242471}bucket 1 full}bucket 2 not full many variations Perfect hashing: one-to-one mapping.hk1k2k3k4Wei-Pang Yang, Information Management, NDHUk4k2k1k3Unit 11 File Organization and Access Methods11-30

Perfect Hash Function Rank Methods[Ghosh 77] h(b1 , b2 , .,bn ) m b1 t( 1 ; b1 ) b2 * ( 1, 2 ; b1 , b 2) . Reduction Methods[Sprugnoli 77] h(ki ) [Sprugndi 77] h(k) ( ki S ) / N(( d kg )mod m) / N Value Assignment Methods[Cichelli 80] Hash valuekey length f(1st char) f(last c)[Jaeschke 80] Counter-example[Cook 82] improve Cichelli's method Reprocal Methods[Jaeschke 81] h(k) C / ( Dk E ) mod n[Chang 84] h(ki ) C mod P(ki ) Hash Indicator Table [HIT] Methods[Du 80, Du 83] h(ki ) hj(ki ) xi if HIT[ht(ki )] t t j and HIT[hj(ki )] j[Yang 83, Yang 85]Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-31

Perfect Hash Function (cont.)BIT 25(1985), 148-161A BACKTRACKING METHOD FOR CONSTRUCTINGPERFECT HASH FUNCTIONS FROM A SET OFMAPPING FUNCTIONSW. P. YANG and M. W. DUInstitute of Computer Engineering, National Chiao Tung University, 45 Po Ai Street, HsinChu,Taiwan, Republic of ChinaWei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-32

11.3.1 Dynamic Hashing Definition Dynamic Hashing: in the hashing scheme the set of keys can be varied, and the addressspace is allocated dynamically.ASKSk1n mα 0.8 n/m.kn.kn12k1k212hASKSm.hmkn' How to achieve it ?m' using a auxiliary table (e.g. index tree, bit-map table, prefix tree, directory, .)KSk1k2 Problems? size (utilization) retrieval time (disk access times) algorithmsWei-Pang Yang, Information Management, NDHU.Unit 11 File Organization and Access Methodsauxiliary tableAS1mm'm''11-33

Dynamic Hashing: Schemes(1) Expandable Hashing Knott, G. D. Expandable Open Addressing Hash Table Storage and Retrieval. Proc. ACMSIGFIDET Workshop on Data Description, Access, and Control, 186-206, 1971.(2) Dynamic Hashing Larson, P. A. Dynamic Hashing. BIT 18(1978) ,184-201. Scholl, M. New File Organization Based on Dynamic Hashing. ACM Trans. on DatabaseSystems, 6, 1(March 1981), 194-211.(3) Virtual Hashing Litwin, W. Virtual Hashing: A Dynamically Changing Hashing. Proc. 4th Conf. on Very LargeData Bases, West Berlin, Sept. 1978, 517-523.(4) Linear Hashing Litwin, W. Linear Hashing: A New Tool for File and Table Addressing. Proc. 6th Conf. onVery Large Data Bases, 212-223, Montreal, Oct. 1980. Larson, P. A. Linear Hashing with Partial Expansions. Proc. 6th Conf. on Very Large DataBases, Montreal, Oct. 1980, 224-232 Larson, P. A. Performance Analysis of Linear Hashing with Partial Expansions. ACM Trans.on Database Systems, 7, 4(Dec. 1982), 566-587.Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-34

Dynamic Hashing: Schemes (cont.)(5) Trie Hashing Litwin, W. Trie Hashing. Res. Rep. MAP-I-014, I.R.I.A. Le Chesnay, France, 1981. (also in Proc.1981 ACM SIGMOD International Conference on Management of Data)(6) Extendible Hashing Fagin, R., Nievergelt, J., Pippenger, N., and Strong, H. R. Extendible Hashing - A Fast AccessMethod for Dynamic Files. ACM Trans. Database System 4, 3(Sept. 1979), 315-344. Tamminen, M. Extendible Hashing with Overflow. Information Processing Lett. 15, 5(Dec. 1982),227-232. Mendelson, H. Analysis of Extendible Hashing. IEEE Trans. on Software Engineering, SE-8, 6(Nov.1982), 611-619. Yao, A. C. A Note on the Analysis of Extendible Hashing. Information Processing Letter 11,2(1980), 84-86.(7) HIT (Hash Indicator Table) Method Du, M. W., Hsieh, T. M., Jea, K. F., and Shieh, D. W. The Study of a New Perfect Hash Scheme.IEEE Trans. On Software Engineering, SE-9, 3(May 1983), 305-313. Yang, W. P., and Du, M. W. Expandable Single-Pass Perfect Hashing. Proc. of National ComputerSymposium, Taiwan, Dec. 1983, 210-217. Yang, W. P., and Du, M. W. A Dynamic Perfect Hash Function Defined by an Extended HashIndicator Table. Proc. 10th Conf. on Very Large Data Bases, Singapore, Aug. 1984. Yang, W. P. Methods for Constructing Perfect Hash Functions and its Application to the Design ofDynamic Hash Files. Doctor Thesis, National Chiao Tung University, Hsinchu, Taiwan, ROC, June1984.(8) ‧‧‧Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-35

Virtual Hashing(Ref: "Virtual Hashing: A dynamically changing hashing", conf. VLDB 1978) Basic Idea: If a bucket overflows, split it into 2 buckets, and set a bit to remember it.Bit MapBExample: key set:1111111111 0 0 0 0 0 0 1 0 0 0{1366, 1256, 2519, 3546, ., 2916,.}0 1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16 17 18 19levell 0levelh0(key) R(key, 10)BucketsBucket Size 3N 10When insert: 29160Retrieval: 1256, 1366, 2519Wei-Pang Yang, Information Management, NDHU.1366125635466Unit 11 File Organization and Access Methods.25199l 1h1(key) R(key, 20)12562916161911-36

Virtual Hashing (cont.) lGeneral hash function:h R(key,2 ·N )l Algorithm Addressing (key)1. jLevel of h j (Max j used)2. mR(key, 2 l· N )3. while B(m) 0ll –1lmR(key, 2 · N )4. Return (m)Wei-Pang Yang, Information Management, NDHU2 10 100h2(key) R(key, 40)Unit 11 File Organization and Access Methods11-37

11.3.2 Extendible Hashing( Ref: Fagin, R. et. al. "Extendible Hashing-A fast access method fordynamic files", ACM TODS, Vol.4, #3 Sept. 79 ) Basic idea : Allownumber of buckets in a certain key range to vary dynamicallybased on actual demand. Depth(d): the number of the most significant bits in f(k) that will be takento determine a directory entry.d- total number of entries in directory 2 Each entry in directory points to a bucket. Each bucket x has a local depth l x d When a bucket x overflows -- increasing l xif l x d -- double directory (i.e. increasing d).Wei-Pang Yang, Information Management, NDHUUnit 11 File Organization and Access Methods11-38

Extendible Hashing: ExampleDiskMain MemoryDepthData pagesDirectory32local depth:000 pointer00***001 pointer010 pointer011 pointer100 pointer101 pointer110 pointerFirst two bits ofpseudokey 003local depth:‧010***x3k1 3956;h(k1) 56 0111000k2 4187;h(k2) 87 1010111.Wei-Pang Yang, Information Management, NDHUlocal depth:k1111 pointerFirst three bits ofpseudokey 010First three bits ofpseudokey 011011***‧1local depth:k2h(k) 1***xUnit 11 File Organization and Access MethodsFirst bit ofpseudokey 1h(k) 1***.11-39

Extendible Hashing: Example (cont.)****0x1main00**x 0***01**y 1***disk1***diskmainData pagesDirectoryDepth g3000 pointer001 pointer010 pointer011 pointer100 pointer101 pointer110 pointer111 pointerData pagesDirectory2h(-) 00.Depth g000 pointer001 pointer3h(-) 010. 010 pointer011 pointer100 pointer3k h(-) 011. 101 pointer110 pointer111 pointer1xold y h(-) 1.3h(-) 00.‧H(k) 56 0111000H(x) 101101H(y) 111011Wei-Pang Yang, Information Management, NDHU23‧h(-) 010.3h(-) 011.**2*Unit 11 File Organization and Access Methodsmm'x h(-) 10.old2y h(-) 11.new11-40

****0x1diskmainDirDeect3pth pointerory000g pointer001010 pointer011 pointer100 pointer101 pointer110 pointer111 pointerxy00**01**1***disk0***1***mainDataDir2 pagesect3h(-) 00. Depth pointerory000g pointer3001h(-) 010.010 pointer ‧3k1xold yData2 pagesh(-) 00.3m‧ m'3011 pointerh(-) 011.100 pointer101 pointer*110 pointer *h(-) 1. 111 pointerh() 01h(-) 011.0.2x*h(-) 10.old2H(k) 56 0111000H(x) 101101H(y) 111011yh(-) 11.newdiskmainData pagesDirectoryDepth g0000 pointer0001 pointer0010 pointer0011 pointer0100 pointer0101 pointer0110 pointer0111 pointer1000 pointer1001 pointer1010 pointer1011 pointer1100 pointer1101 pointer1110 pointer1111 pointerWei-Pang Yang, Information Management, NDHU324h(-) 00.‧‧··4m h(-) 0100.3kh(-) 011.23x h(-) 10.2y****Unit 11newh(-) 11.4m' h(-) 0101.·· and Access MethodsFile Organization11-41

Our Research ResultsConcurrent Operations in Extendible Hashing[VLDB86]Meichun HsuWei-Pang YangHarvard UniversityCambridge MA 02138AbstractAn a

Wei-Pang Yang, Information Management, NDHU Unit 11 File Organization and Access Methods 11-8 The Role of Access Method in DBMS Query in SQL: SELECT CUSTOMER. NAME FROM CUSTOMER, INVOICE WHERE REGION 'N.Y.' AND AMOUNT 10000 AND CUSTOMER.C# INVOICE.C# Internal Form