Hashing & Hash TablesHashing & Hash Tables

Transcription

Hashing & Hash TablesCpt S 223. School of EECS, WSU1

Overview Hash Table Data Structure : Purpose To support insertion, deletion and search inaverage-case constantt t titime Hash function Assumption: Order of elements irrelevant data structure *not* useful for if you want tomaintaini t i andd retrievetisome kindki d off an orderd off thethelementsHash[ “string key”] integer valueHash table ADT I lImplementations,t tiAnalysis,A l i ApplicationsA li tiCpt S 223. School of EECS, WSU2

Hash table: Main componentskeyvalue“john”TableSizeeHash indexh(“john”)keyHashfunctionHow to determine ?Cpt S 223. School of EECS, WSUHash table(implemented as a vector)3

Hash Table Hash table is an array of fixedsize TableSizekeyElement valueArray elements indexed by akey, which is mapped to anarray index (0 TableSize-1)Mapping (hash function) hfrom key to index E.g., h(“john”) 3Cpt S 223. School of EECS, WSU4

Hash Table Operations Insert T [h(“john”)] “john”,25000 DatarecordDelete Hash keyHashffunctiontiT [h([h(“john”)]john )] NULLSearch T [h(“john”)] returns theelement hashed for “john”What happens if h(“john”)h( john ) h(h(“joe”)joe ) ?“collision”Cpt S 223. School of EECS, WSU5

Factors affecting Hash TableDesign Hash function Table size Usuallyy fixed at the startCollision handling schemeCpt S 223. School of EECS, WSU6

Hash Function A hash function is one which maps anelement’s key into a valid hash table index h(key) hash table indexNote that this is (slightly) different from saying:h(string) int Because the key can be of any type E.g., “h(int) int” is also a hash function!But also note that anyy typeyp can be converted intoan equivalent string formCpt S 223. School of EECS, WSU7

h(key) hash table indexHash Function Properties A hash function maps key to integer Constraint: Integer should be between[0, TableSize-1]A hash function can result in a many-to-one mapping(causing collision) Collision occurs when hash function maps two or more keysto same array indexCollisionsClli icannott beb avoidedid d butb t itsit chanceshcan bebreduced using a “good” hash functionCpt S 223. School of EECS, WSU8

h(key) hash table indexHash Function Properties A “good” hash function should have theproperties:1.Reduced chance of collisionDifferent keys should ideally map to differentindicesDistribute keys uniformly over table2.Should be fast to computeCpt S 223. School of EECS, WSU9

Hash Function - Effective useof table size Simple hash function (assume integer keys) h(Key) Key mod TableSizeFor random keys, h() distributes keys evenlyover table What if TableSize 100 and keys are ALLmultiples of 10?Better if TableSize is a prime numberCpt S 223. School of EECS, WSU10

Different Ways to Design aHash Function for String KeysA very simple function to map strings to integers: Add up character ASCII values (0-255) to produceinteger keys E.g., “abcd” 97 98 99 100 394 h(“abcd”) 394 % TableSizePotential problems: Anagrams will map to the same index Small strings may not use all of table h(“abcd”) h(“dbac”)Strlen(S) * 255 TableSizeTime proportional to length of the stringCpt S 223. School of EECS, WSU11

Different Ways to Design aHash Function for String Keys Approach 2 Treat first 3 characters of string as base-27 integer (26letters plus space) Key S[0] (27 * S[1]) (272 * S[2])Better than approach 1 because ?Potential problems: Assumes first 3 characters randomly distributed Not true of EnglishAppleApplyppAppointmentApricotcollisionCpt S 223. School of EECS, WSU12

Different Ways to Design aHash Function for String Keys Approach 3Use all N characters of string as anN-digitg base-K number Choose K to be prime numberlarger than number of differentdigits (characters) I.e., K 29, 31, 37If L length of string S, then L 1 h( S ) S [ L i 1] 37 i mod TableSize i 0 Problems: Use Horner’s rule to compute h(S)potential overflow Li it L forLimitf longlstringstilarger runtime Cpt S 223. School of EECS, WSU13

“Collision resolution techniques”qTechniquesTh itot DealD l withithCollisionsChainingOpen addressingDouble hashingEtc.EtcCpt S 223. School of EECS, WSU14

Resolving Collisions What happens when h(k1) h(k2)? collision !Collision resolution strategies Chaining Store colliding keys in a linked list at the samehash table indexOpen addressing Store collidingg keysy elsewhere in the tableCpt S 223. School of EECS, WSU15

Ch i iChainingCollision resolution technique #1Cpt S 223. School of EECS, WSU16

Chaining strategy: maintains a linked list atevery hash index for collided elementsInsertion sequence: { 0 1 4 9 16 25 36 49 64 81 } Hash table T is a vector oflinked lists Insert element at the head(as shown here) or at the tailKey k is stored in list atT[h(k)]E.g.,g TableSize 10 h(k) k mod 10Insert first 10 perfectsquaresCpt S 223. School of EECS, WSU17

Implementation of ChainingHash TableVector of linked lists(this is the mainhashtable)Current #elements inthe hashtableHash functions fori tintegersandd stringtikeysCpt S 223. School of EECS, WSU18

Implementation of ChainingHash TableThis is the hashtable’scurrent capacity(aka. “table size”)This is the hash tableindex for the elementxCpt S 223. School of EECS, WSU19

Duplicate checkLater, but essentiallyresizes the hashtable if itsgetting crowdedCpt S 223. School of EECS, WSU20

Each of theseoperations takes timelinear in the length ofthe list at the hashedindex locationCpt S 223. School of EECS, WSU21

All hash objects mustdefine and ! operators.Hash function tohandle Employeeobject typeCpt S 223. School of EECS, WSU22

Collision Resolution byChaining: Analysis Load factor λ of a hash table T is defined as follows: N number of elements in TM sizei off Tλ N/M i.e., λ is the average length of a chainUnsuccessful search time: O(λ) (“current size”)(“t bl size”)(“tablei ”)(“ load factor”)Same for insert timeSuccessful search time: O(λ/2)Ideally, want λ 1 (not a function of N)Cpt S 223. School of EECS, WSU23

Potential disadvantages ofChainingLinked lists could get long Especially when N approaches M LongerLlinkedli k d listsli t couldld negativelyti l impactitperformanceMore memory because of pointersAbsolute worst-case (even if N M): All N elements in one linked list! Typically the result of a bad hash functionCpt S 223. School of EECS, WSU24

OOpenAddressingAddiCollision resolution technique #2Cpt S 223. School of EECS, WSU25

Collision Resolution byOpen AddressingAn “inplace” approachWhen a collision occurs, look elsewhere in thetable for an empty slot Advantages over chaining No need for list structuresNoo needeed too allocate/deallocatea oca e/dea oca e memorye o y duduringginsertion/deletion (slow)Disadvantages Slower insertion – May need several attempts to find anempty slotTable needs to be bigger (than chaining-based table) toachieve averageaverage-casecase constantconstant-timetime performance Load factor λ 0.5Cpt S 223. School of EECS, WSU26

Collision Resolution byOpen Addressing A “Probe sequence” is a sequence of slots in hash table whilesearching for an element x h0(x),(x) h1(x),(x) h2(x),(x) Needs to visit each slot exactly once Needs to be repeatable (so we can find/delete what we’veinserted)Hash function hi(x) (h(x) f(i)) mod TableSizef(0) 0 position for the 0th probef(i)( ) is “the distance to be traveled relative to the 0th pprobeposition, during the ith probe”.Cpt S 223. School of EECS, WSU27

Linear Probingi probethindex Linear probing:0th probeboccupied1stoccupied2nd probeoccupiedprobe if(i) is a linear function of i,E.g., f(i) ihi(x) (h(x) i) mod TableSize3rd probe i0th probeindexProbe sequence: 0, 1, 2, 3, 4, unoccupiedPopulate x hereContinue until an empty slot is found#failed probes is a measure of performanceCpt S 223. School of EECS, WSU28

ith probeindex 0th probeindex iLinear Probing f(i) is a linear function of i, e.g., f(i) i hi(x) (h(x) i) mod TableSize Probe sequence: 0, 1, 2, 3, 4, Example: h(x) x mod TableSize h0(89)h0(18)h0(49)h1(49) (h(89) f(0)) mod 10 9 (h(18) f(0)) mod 10 8 (h(49) f(0)) mod 10 9 (X) (h(49) f(1)) mod 10 (h(49) 1 ) mod 10 0Cpt S 223. School of EECS, WSU29

Linear Probing ExampleIInsertt sequence: 89,89 1818, 4949, 5858, 69#unsuccessfulprobes:00time1Cpt S 223. School of EECS, WSU337total30

Linear Probing: IssuesProbe sequences can get longer with timePrimary clustering Keys tend to cluster in one part of tableKeys that hash into cluster will be added tothe end of the cluster (making it evenbigger)Side effect: Other keys could also getaffected if mapping to a crowdedneighborhoodCpt S 223. School of EECS, WSU31

Linear Probing: Analysis Expected number ofprobes for insertion orunsuccessful search1 1 1 2 2 (1 ) Expected number ofprobes for successfulsearch1 1 1 2 (1 ) Example (λ 0.5) Insert / unsuccessfulsearch Successful search 2.5 probes1 5 probes1.5bExample (λ 0.9) Insert / unsuccessfulsearch 50.5 probesSuccessful search Cpt S 223. School of EECS, WSU5.5 probes32

Random Probing: Analysis Random probing does not suffer fromclusteringExpected number of probes for insertion orunsuccessful search:11 Example lln1 λ 0.5: 1.4 probesλ 0.9: 2.6 probesCpt S 223. School of EECS, WSU33

# probeesLinear vs. Random ProbingU - unsuccessful searchS - successful searchI - insertLinear probingRandom probinggoodbadLoad factor λCpt S 223. School of EECS, WSU34

Quadratic ProbingQuadratic probing:occupiedoccupied0th probe1st probe2nd probe Avoids primary clusteringf(i) is quadratic in ie.g., f(i) i2hi(x) (h(x) i2) modTableSizeoccupied3rd probe Probe sequence:q 0, 1, 4, 9, 16, i occupiedContinue until an empty slot is found#failed probes is a measure of performanceCpt S 223. School of EECS, WSU35

Quadratic Probing Avoids primary clusteringf(i) is quadratic in I,I e.g.,e g f(i) i2 hi(x) (h(x) i2) mod TableSize Probe sequence: 0, 0 1, 1 4, 4 9, 9 16, 16 Example: h0(58) (h(58) f(0))(h(58) f(0)) modd 10 8 (X)h1(58) (h(58) f(1)) mod 10 9 (X)h2(58)( 8) (h(58) f(2))(h( 8) f(2)) modd 100 2Cpt S 223. School of EECS, WSU36

Q) Delete(49), Find(69) - is there a problem?Quadratic Probing ExampleIInsertt sequence: 89,89 1818, 4949, 5858, 69 12 12 22 22 02 02 02 02#unsuccessfulprobes:001Cpt S 223. School of EECS, WSU 122 0225total37

Quadratic Probing: Analysis Difficult to analyzeTheorem 5.1 New element can always be inserted into a tablethat is at least half empty and TableSize is primeOtherwise, may never find an empty slot,even is one existsEnsure table never gets half full If close, then expand itCpt S 223. School of EECS, WSU38

Quadratic Probing May cause “secondary clustering” Deletion Emptyingp y g slots can break probepsequenceqandcould cause find stop prematurelyLazy deletion DifferentiateDiffti t bbetweentemptyt anddddeletedl t d slotl tWhen finding skip and continue beyond deleted slots If you hit a non-deleted empty slot, then stop find procedurereturning “not found”May need compactionat some timeCpt S 223. School of EECS, WSU39

Quadratic Probing:ImplementationCpt S 223. School of EECS, WSU40

Quadratic Probing:ImplementationLazy deletionCpt S 223. School of EECS, WSU41

Quadratic Probing:ImplementationEnsure tablesize is primeCpt S 223. School of EECS, WSU42

Quadratic Probing:ImplementationFindSkip DELETED;No duplicatesQuadratic probesequence (really)Cpt S 223. School of EECS, WSU43

Quadratic Probing:ImplementationInsertNo duplicatesRemoveNo deallocationneededCpt S 223. School of EECS, WSU44

Double Hashing: keep twohash functions h1 and h2 Use a second hash function for all tries Iother than 0:f(i) i * h2(x)Good choices for h2(x) ? Should never evaluate to 0h2(x) R – (x mod R) R is prime number less than TableSizeP iPreviousexamplel withith R 7R 7 h0(49) (h(49) f(0)) mod 10 9 (X)h1(49) (h(49) 1*(7 – 49 mod 7)) mod 10 6Cpt S 223. School of EECS, WSUf(1)45

Double Hashing ExampleCpt S 223. School of EECS, WSU46

Double Hashing: Analysis Imperative that TableSize is prime E g insert 23 into previous tableE.g.,Empirical tests show double hashingclose to random hashingExtra hash function takes extra time tocomputetCpt S 223. School of EECS, WSU47

Probing Techniques - reviewQuadratic probing:0th tryt1st try2nd tryi0th try1st try0th tryi2nd tryttry3rd try 3rd iDouble hashing*:Cpt S 223. School of EECS, WSU2nd try1stt try3rd try Linear probing:*(determined by a secondhash function)48

Rehashing Increases the size of the hash table when load factorbecomes “too high” (defined by a cutoff) Anticipating that prob(collisions) would becomehigherTypically expand the table to twice its size (but stillprime)Need to reinsert all existing elements into new hashtableCpt S 223. School of EECS, WSU49

Rehashing Exampleh(x) x mod 7λ 0.570 57h(x) x mod 17λ 0.290 29Insert 23Rehashingλ 0.71Cpt S 223. School of EECS, WSU50

Rehashing Analysis Rehashing takes time to do N insertionsTherefore should do it infrequentlySpecifically Mustt hMhave beenbN/2 iinsertionstisinceilastl trehashAAmortizingti i theth O(N) costt over theth N/2 prioriinsertions yields only constant additionaltime per insertionCpt S 223. School of EECS, WSU51

Rehashing Implementation When to rehash When load factor reaches some threshold(e.g,. λ 0.5), ORWhen an insertion failsApplies across collision handlingschemesCpt S 223. School of EECS, WSU52

Rehashing for ChainingCpt S 223. School of EECS, WSU53

Rehashing forQuadratic ProbingCpt S 223. School of EECS, WSU54

Hash Tables in C STL Hash tables not part of the C Standard LibrarySome implementations of STL havehash tables (e.g.,(e g SGISGI’ss STL) hash sethash maphash mapCpt S 223. School of EECS, WSU55

Hash Set in STL#include hash hash set set struct eqstr{bool operator()(const char* s1, const char* s2) const{return strcmp(s1, s2) 0;}};void lookup(const hash set const char*, hash const char* , eqstr & Set,const char* word){hash set const char*, hash const char* , eqstr ::const iterator it Set.find(word);cout word ": " (it ! Set.end()Set end() ? "present" : "not present") endl;}KeyHash fnKey equality testint main(){hash set const char*, hash const char* , eqstr Set;Set.insert("kiwi");lookup(Set, “kiwi");}Cpt S 223. School of EECS, WSU56

Hash Map in STL#i l d h#include hash map h struct eqstr{bool operator() (const char* s1, const char* s2) const{return strcmp(s1, s2) 0;}};KeyDataHash fnKey equality testint main(){hash map const char*, int, hash const char* , eqstr months;Internallymonths["january"] 31;treatedmonths["february"] 28;like insert (or overwritemonths["december"] 31;if keycout “january - " months[“january"] endl;already present)}Cpt S 223. School of EECS, WSU57

Problem with Large Tables What if hash table is too large to storein main memory?Solution: Store hash table on disk Minimize disk accessesBut Collisionsllrequire diskd k accessesRehashing requires a lot of disk accessesSolution: Extendible HashingCpt S 223. School of EECS, WSU58

Hash Table Applications Symbol table in compilersAccessing tree or graph nodes by name E.g.,g , cityc ty namesa es in GoogGooglee mapsapsMaintaining a transposition table in games Remember previous game situations and the move taken(avoid rere-computation)computation)Dictionary lookups Spelling checkers NaturalN t l llanguage understandingd t di (word(d sense))Heavily used in text processing languages E.g., Perl, Python, etc.Cpt S 223. School of EECS, WSU59

Summary Hash tables support fast insert andsearch O(1) average case performanceDeletion possiblepossible, but degradesperformanceNot suited if ordering of elements isimportantMany applicationsCpt S 223. School of EECS, WSU60

Points to remember - Hashtables Table size primeTable size much larger than number of inputs(to maintain λ closer to 0 or 0.5)Tradeoffs between chaining vs. probingC lli i chancesCollisionhdecreasedini thishi order:dlinear probing quadratic probing {random probing, double hashing}Rehashing required to resize hash table at atime when λ exceeds 0.5Good for searching. Not good if there is someCpt S data.223. School of EECS, WSU61order implied by

h(key) hash table index Hash Function Properties A hash function maps key to integer Constraint: Integer should be between [0, TableSize-1] A hash function can result in a many-to-one mapping (causing collision)(causing collision) Collision occurs when hash function maps two or more keys to same array index C lli i t b id d b t it h bCollisions cannot be avoided but its chances can be