Functions And Operators For Executing Similarity Queries - PGCon

Transcription

IntroductionPostgreSQLpg similaritypg similarityfunctions and operators for executing similarity queriesEuler Taveira de Oliveira4LinuxPostgreSQL Brasil21 de maio de 2009Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityAgenda1Introduction2PostgreSQL3pg similarityEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityApproximate Queriesdatabase works with a exact query modelkey valueEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityApproximate Queriesdatabase works with a exact query modelkey valuewe want to be able to do some almost exact querieskey valueEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityApproximate Queriesdatabase works with a exact query modelkey valuewe want to be able to do some almost exact querieskey valueand to be able to define how flexible it issimilarity: φ thresholdEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityProblemshumanmistypingimprecise typing (abbreviation, omission, .)insufficient informationEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityProblemshumanmistypingimprecise typing (abbreviation, omission, .)insufficient informationapplicationapplication with problems (truncation, abbreviation – withoutuser asks for)data model failedinexistent or imprecise constraintsinexistent foreign keysEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityProblemshumanmistypingimprecise typing (abbreviation, omission, .)insufficient informationapplicationapplication with problems (truncation, abbreviation – withoutuser asks for)data model failedinexistent or imprecise constraintsinexistent foreign keysobsolencereal world is dynamic!Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityConsequencespoor data quality”2% of records in a customer file become obsolete in onemonth because customers die, divorce, marry and move[DWI02]Bill Clinton, William Jefferson Clinton, and William JeffersonBlythe III: same person?Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityConsequencespoor data quality”2% of records in a customer file become obsolete in onemonth because customers die, divorce, marry and move[DWI02]Bill Clinton, William Jefferson Clinton, and William JeffersonBlythe III: same person?high financial impact”poor data quality cost USA businesses a staggering USD 611bi/year in postage, printing and staff overhead” [DWI02]”Wrong price data in retail databases costs Americanconsumers USD 2.5 billion in annual overcharges” [E00]Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similaritySolution?Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityAgenda1Introduction2PostgreSQL3pg similarityEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityWhat is available?regular expressionfuzzystrmatchtext searchEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityProblems?inversion (euler taveira x taveira euler)mistyping (euler x euller)stopwords (euler de oliveira x euler oliveira)stemming (similarity x similarities)flexibility (change tokenizer, threshold, etc)Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityAgenda1Introduction2PostgreSQL3pg similarityEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityMotivationIdeaslow response timedo not use pre-processing step (online technique)do not use auxiliary structure (catalog or auxiliary table)maintain result qualityextensibleEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityMotivationIdeaslow response timedo not use pre-processing step (online technique)do not use auxiliary structure (catalog or auxiliary table)maintain result qualityextensibleDesignsimilarity functionsoperatorsauxiliary functionsEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityInstallation cd pg similarity PATH /pg/bin: PATH USE PGXS 1 gmake install /pg/bin/psql -f /pg/share/contrib/pg similarity.sql mydbCREATE FUNCTIONCREATE FUNCTIONCREATE FUNCTION.CREATE OPERATOR.Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similaritySimilarity Functionsis a function that calculates how similar are two datadomain-dependent functions (almost all of them)some functions need parsing step (tokenization: space,non-alphanumeric, and eanQ-GramMonge-ElkanLevenshteinEuler Taveira de PGCon 2009

IntroductionPostgreSQLpg similarityExample: 433323r5544432min(a[x][y 1] i, a[x 1][y ] d, a[x 1][y 1] s)insertion (i), deletion (d), and substitution (s) cost 1Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityNormalized x Unnormalizedit is easier to choose a threshold if you know the limitsalmost all of the functions return unnormalized results!euler # select lev(’euler’, ’heuser’); -- defaultlev---------0.666667(1 row)euler # select lev(’euler’, ’heuser’); -- hey, that’s it!lev----2(1 row)Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityOperatorsSQL syntax sugar for similarity functionseuler # CREATE TABLE foo (a text);CREATE TABLEeuler # INSERT INTO foo VALUES(’Euler T. de Oliveira’),euler-# (’Euler Taveira de Oliveira’);INSERT 0 2euler # SELECT * FROM foo WHERE a @@ ’Euler Taveira’;a--------------------------Euler Taveira de OliveiraEuler T. de Oliveira(2 rows)Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityOperators (cont’d)euler # SELECT set threshold(’jarowinkler’, 0.9);set threshold--------------0.9(1 registro)euler # SELECT * FROM foo WHERE a @@ ’Euler Taveira’;a--------------------------Euler Taveira de Oliveira(1 row)Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityAuxiliary Functionsget isnormalized(func): values are true and falseset isnormalized(func, value): switches between normalizedand unnormalized resultsget threshold(func): values are between 0 and 1set threshold(func, value): values greater than value arereturnedget tokenizer(func): values are spaces, nonalnum, andn-gramset tokenizer(func, value): changes tokenization functionfor some algorithmsEuler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityAuxiliary Functions: exampleeuler # select get threshold(’jarowinkler’);get threshold--------------0.7(1 row)euler # select set threshold(’jarowinkler’, 0.85);set threshold--------------0.85(1 row)Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityTODOsome more similarity functions (listed in TODO)add some custom guc variables (replace get * and set *)UTF-8 aware?add some selectivity estimator functions for operatorswebsitewrite some docs (the source-code has examples how eachfunction works)your idea?Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg pgsimilarity/E00 English, L. Information Quality Management: The Next Frontier. DMReview Magazine, April 2000.http://www.dmreview.com/article sub.cfm?articleId 2073DWI02 Data Warehousing Institute report 2002.Euler Taveira de OliveiraPGCon 2009

IntroductionPostgreSQLpg similarityQuestions?Euler Taveira de ler Taveira de OliveiraPGCon 2009

database works with a exact query model key value we want to be able to do some almost exact queries key ˇ value . (change tokenizer, threshold, etc) Euler Taveira de Oliveira PGCon 2009. Introduction PostgreSQL pg similarity . Dice Monge-Elkan Jaccard Hamming Levenshtein Needleman-Wunsch Euler Taveira de Oliveira PGCon 2009.