Transcription
IntroductionPostgreSQLpg similaritypg similarityfunctions and operators for executing similarity queriesEuler Taveira de Oliveira4LinuxPostgreSQL Brasil21 de maio de 2009Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityAgenda1Introduction2PostgreSQL3pg similarityEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityApproximate Queriesdatabase works with a exact query modelkey valueEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityApproximate Queriesdatabase works with a exact query modelkey valuewe want to be able to do some almost exact querieskey valueEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityApproximate Queriesdatabase works with a exact query modelkey valuewe want to be able to do some almost exact querieskey valueand to be able to define how flexible it issimilarity: φ thresholdEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityProblemshumanmistypingimprecise typing (abbreviation, omission, .)insufficient informationEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityProblemshumanmistypingimprecise typing (abbreviation, omission, .)insufficient informationapplicationapplication with problems (truncation, abbreviation – withoutuser asks for)data model failedinexistent or imprecise constraintsinexistent foreign keysEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityProblemshumanmistypingimprecise typing (abbreviation, omission, .)insufficient informationapplicationapplication with problems (truncation, abbreviation – withoutuser asks for)data model failedinexistent or imprecise constraintsinexistent foreign keysobsolencereal world is dynamic!Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityConsequencespoor data quality”2% of records in a customer file become obsolete in onemonth because customers die, divorce, marry and move[DWI02]Bill Clinton, William Jefferson Clinton, and William JeffersonBlythe III: same person?Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityConsequencespoor data quality”2% of records in a customer file become obsolete in onemonth because customers die, divorce, marry and move[DWI02]Bill Clinton, William Jefferson Clinton, and William JeffersonBlythe III: same person?high financial impact”poor data quality cost USA businesses a staggering USD 611bi/year in postage, printing and staff overhead” [DWI02]”Wrong price data in retail databases costs Americanconsumers USD 2.5 billion in annual overcharges” [E00]Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similaritySolution?Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityAgenda1Introduction2PostgreSQL3pg similarityEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityWhat is available?regular expressionfuzzystrmatchtext searchEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityProblems?inversion (euler taveira x taveira euler)mistyping (euler x euller)stopwords (euler de oliveira x euler oliveira)stemming (similarity x similarities)flexibility (change tokenizer, threshold, etc)Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityAgenda1Introduction2PostgreSQL3pg similarityEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityMotivationIdeaslow response timedo not use pre-processing step (online technique)do not use auxiliary structure (catalog or auxiliary table)maintain result qualityextensibleEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityMotivationIdeaslow response timedo not use pre-processing step (online technique)do not use auxiliary structure (catalog or auxiliary table)maintain result qualityextensibleDesignsimilarity functionsoperatorsauxiliary functionsEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityInstallation cd pg similarity PATH /pg/bin: PATH USE PGXS 1 gmake install /pg/bin/psql -f /pg/share/contrib/pg similarity.sql mydbCREATE FUNCTIONCREATE FUNCTIONCREATE FUNCTION.CREATE OPERATOR.Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similaritySimilarity Functionsis a function that calculates how similar are two datadomain-dependent functions (almost all of them)some functions need parsing step (tokenization: space,non-alphanumeric, and eanQ-GramMonge-ElkanLevenshteinEuler Taveira de PGCon 2009
IntroductionPostgreSQLpg similarityExample: 433323r5544432min(a[x][y 1] i, a[x 1][y ] d, a[x 1][y 1] s)insertion (i), deletion (d), and substitution (s) cost 1Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityNormalized x Unnormalizedit is easier to choose a threshold if you know the limitsalmost all of the functions return unnormalized results!euler # select lev(’euler’, ’heuser’); -- defaultlev---------0.666667(1 row)euler # select lev(’euler’, ’heuser’); -- hey, that’s it!lev----2(1 row)Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityOperatorsSQL syntax sugar for similarity functionseuler # CREATE TABLE foo (a text);CREATE TABLEeuler # INSERT INTO foo VALUES(’Euler T. de Oliveira’),euler-# (’Euler Taveira de Oliveira’);INSERT 0 2euler # SELECT * FROM foo WHERE a @@ ’Euler Taveira’;a--------------------------Euler Taveira de OliveiraEuler T. de Oliveira(2 rows)Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityOperators (cont’d)euler # SELECT set threshold(’jarowinkler’, 0.9);set threshold--------------0.9(1 registro)euler # SELECT * FROM foo WHERE a @@ ’Euler Taveira’;a--------------------------Euler Taveira de Oliveira(1 row)Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityAuxiliary Functionsget isnormalized(func): values are true and falseset isnormalized(func, value): switches between normalizedand unnormalized resultsget threshold(func): values are between 0 and 1set threshold(func, value): values greater than value arereturnedget tokenizer(func): values are spaces, nonalnum, andn-gramset tokenizer(func, value): changes tokenization functionfor some algorithmsEuler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityAuxiliary Functions: exampleeuler # select get threshold(’jarowinkler’);get threshold--------------0.7(1 row)euler # select set threshold(’jarowinkler’, 0.85);set threshold--------------0.85(1 row)Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityTODOsome more similarity functions (listed in TODO)add some custom guc variables (replace get * and set *)UTF-8 aware?add some selectivity estimator functions for operatorswebsitewrite some docs (the source-code has examples how eachfunction works)your idea?Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg pgsimilarity/E00 English, L. Information Quality Management: The Next Frontier. DMReview Magazine, April 2000.http://www.dmreview.com/article sub.cfm?articleId 2073DWI02 Data Warehousing Institute report 2002.Euler Taveira de OliveiraPGCon 2009
IntroductionPostgreSQLpg similarityQuestions?Euler Taveira de ler Taveira de OliveiraPGCon 2009
database works with a exact query model key value we want to be able to do some almost exact queries key ˇ value . (change tokenizer, threshold, etc) Euler Taveira de Oliveira PGCon 2009. Introduction PostgreSQL pg similarity . Dice Monge-Elkan Jaccard Hamming Levenshtein Needleman-Wunsch Euler Taveira de Oliveira PGCon 2009.