Document Similarity In Information Retrieval

Transcription

Document Similarity inInformation RetrievalMausam(Based on slides of W. Arms, ThomasHofmann, Ata Kaban, Melanie Martin)

Standard Web Search Engine Architecturecrawl thewebstore documents,check for duplicates,extract linksDocIdscreate aninvertedindexuserqueryshow resultsTo userSearchengineserversinvertedindexSlide adapted from Marti Hearst / UC Berkeley]

Indexing SubsystemDocumentstextassign document IDsbreak into top ntnumbersand *fieldnumbersterm weighting*terms withweightsIndexdatabase

Search Subsystemquery parse queryrankeddocument setquery tokensstop optionaloperation.Booleanretrieved operations*document setrelevantdocument setstemmedtermsIndexdatabase

Terms vs tokens Terms are what results after tokenizationand linguistic processing.– Examples knowledge - knowledg The - the Removal of stop words

Matching/Ranking of TextualDocumentsMajor Categories of Methods1. Exact matching (Boolean)2. Ranking by similarity to query (vector space model)3. Ranking of matches by importance of documents(PageRank)4. Combination methodsWhat happens in major search engines (Googlerank)

Vector representation of documents and queriesWhy do this? Represents a large space for documents Compare– Documents– Documents with queries Retrieve and rank documents with regards to aspecific query- Enables methods of similarityAll search engines do this.

Boolean queries Document is relevant to a query of thequery itself is in the document.– Query blue and red brings back all documentswith blue and red in them Document is either relevant or not relevantto the query. What about relevance ranking – partialrelevance. Vector model deals with this.

Similarity Measures and Relevance Retrieve the most similar documents to aquery Equate similarity to relevance– Most similar are the most relevant This measure is one of “text similarity”– The matching of text or words

Similarity Ranking MethodsQueryIndexdatabaseDocumentsMechanism for determining the similarityof the query to the document.Set of documentsranked by how similarthey are to the query

Term Similarity: ExampleProblem: Given two text documents, how similar are they?[Methods that measure similarity do not assume exactmatches.]Example (assume tokens converted to terms)Here are three documents. How similar are they?d1d2d3ant ant beedog bee dog hog dog ant dogcat gnu dog eel foxDocuments can be any length from one word to thousands.A query is a special type of document.

Bag of words view of a docTokens are extracted from text and throwninto a “bag” without order and labeled bydocument.is Thus the doc– John is quicker than Mary.is indistinguishable from the doc– Mary is quicker than John.JohnMarythanquicker

Term Similarity: Basic ConceptTwo documents are similar if they contain some of the sameterms.Possible measures of similarity might take into consideration:(a) The lengths of the documents(b) The number of terms in common(c) Whether the terms are common or unusual(d) How many times each term appears

TERM VECTOR SPACETerm vector spacen-dimensional space, where n is the number of differentterms/tokens used to index a set of documents.VectorDocument i, di, represented by a vector. Its magnitude indimension j is wij, where:wij 0wij 0if term j occurs in document iotherwisewij is the weight of term j in document i.

A Document Represented in a3-Dimensional Term Vector Spacet3d1t13t2t11t12t1

Basic Method: Incidence Matrix(Binary Weighting)documentd1d2d3textant ant beedog bee dog hog dog ant dogcat gnu dog eel foxtermsant beeant bee dog hogcat dog eel fox gnuant bee cat dog eel fox gnu hogd111d211d31111113 vectors in8-dimensionalterm vectorspace1Weights: tij 1 if document i contains term j and zero otherwise

Basic Vector Space Methods: Similaritybetween 2 documentsThe similarity betweentwo documents is afunction of the anglebetween their vectors inthe term vector space.t3d1d2 t2t1

Vector Space Revisionx (x1, x2, x3, ., xn) is a vector in an n-dimensional vector spaceLength of x is given by (extension of Pythagoras's theorem) x 2 x12 x22 x32 . xn2 x ( x12 x22 x32 . xn2 )1/2If x1 and x2 are vectors:Inner product (or dot product) is given byx1.x2 x11x21 x12x22 x13x23 . x1nx2nCosine of the angle between the vectors x1 and x2:cos ( ) x1.x2 x1 x2

Document similarityd (x1, x2, x3, ., xn) is a vector in an n-dimensional vector spaceLength of x is given by (extension of Pythagoras's theorem) d 2 x12 x22 x32 . xn2 d ( x12 x22 x32 . xn2 )1/2If d1 and d2 are document vectors:Inner product (or dot product) is given byd1.d2 x11x21 x12x22 x13x23 . x1nx2nCosine angle between the docs d1 and d2 determines doc similarityd1.d2cos ( ) d1 d2 cos ( ) 1; documents exactly the same; 0, totally different

Example 1No Weightingant bee cat dog eel fox gnu hogd111d211d3length 21111111Ex: length d1 (12 12)1/2 4 5

Example 1 (continued)Similarity of documents in example:d1d2d3d110.710d20.7110.22d300.221

Digression: terminology WARNING: In a lot of IR literature,“frequency” is used to mean “count”– Thus term frequency in IR literature is used tomean number of occurrences in a doc– Not divided by document length (which wouldactually make it a frequency) We will conform to this misnomer– In saying term frequency we mean the numberof occurrences of a term in a document.

Example 2Weighting by Term Frequency (tf)documentd1d2d3textant ant beedog bee dog hog dog ant dogcat gnu dog eel foxtermsant beeant bee dog hogcat dog eel fox gnuant bee cat dog eel fox gnu hogd121d211d3length 54111111 19 5Weights: tij frequency that term j occurs in document i

Example 2 (continued)Similarity of documents in y depends upon the weights given to the terms.[Note differences in results from Example 1.]

Summary: Vector SimilarityComputation with WeightsDocuments in a collection are assigned terms from a set of n termsThe term vector space W is defined as:if term k does not occur in document di, wik 0if term k occurs in document di, wik is greater than zero(wik is called the weight of term k in document di)Similarity between di and dj is defined as:ncos(di, dj) wikwjkk 1 di dj Where di and dj are the corresponding weighted term vectors and di is the length of the document vector di

Summary: Vector SimilarityComputation with WeightsInner product (or dot product) between documentsd1.d2 w11w21 w12w22 w13w23 . w1nw2nInner product (or dot product) is between a document and queryd1.q1 w11wq11 w12wq12 w13wq13 . w1nwq1nwhere wqij is the weight of the jth term of the ith query

Simple Uses of Vector Similarityin Information RetrievalThresholdFor query q, retrieve all documents with similarityabove a threshold, e.g., similarity 0.50.RankingFor query q, return the n most similar documents rankedin order of similarity.[This is the standard practice.]

Simple Example of Ranking(Weighting by Term Frequency)queryqdocumentd1d2d3ant dogtextant ant beedog bee dog hog dog ant dogcat gnu dog eel foxtermsant beeant bee dog hogcat dog eel fox gnuant bee cat dog eel fox gnu hogqd1d2d31211111411111length 2 5 19 5

Calculate RankingSimilarity of query to documents in example:d1qd2d32/ 10 5/ 38 1/ 100.630.810.32If the query q is searched against thisdocument set, the ranked results are:d2, d1, d3

Bigger Corpora Consider– n 1M documents,– each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation– 6GB of data. Say there are m 500K distinct terms .

Can’t Build the Matrix 500K x 1M matrix: 500 Billion 0’s and 1’s. But it has no more than 1 billion 1’s.– matrix is extremely sparse. What’s a better representation?

Inverted indexDocuments are parsed to extractwords and these are saved with thedocument ID.Doc 1I did enact JuliusCaesar I waskilled i' theCapitol; Brutuskilled me.Doc 2So let it be withCaesar. TheNoble Brutus hathtold you Caesarwas ambitiousTerm Doc itol1brutus1killed1me1so2let2

Later, sort inverted file by olbrutuskilledmeDoc #11111111111111TermDoc caesar2did1enact1hath1I1I1

Multiple termentries in a singledocument aremerged andfrequencyinformation addedTermDoc ermDoc 1211111111111

Best Choice of Weights?queryqdocumentd1d2d3ant dogtextant ant beedog bee dog hog dog ant dogcat gnu dog eel foxtermsant beeant bee dog hogcat dog eel fox gnuant bee cat dog eel fox gnu hogqd1d2d3?Whatweights leadto the bestinformationretrieval?

WeightingTerm Frequency (tf)Suppose term j appears fij times in document i. Whatweighting should be given to a term j?Term Frequency: ConceptA term that appears many times within a document islikely to be more important than a term that appearsonly once.

Term Frequency: Free-textDocumentLength of documentiSimple method is to use wij as the term frequency.but, in free-text documents, terms are likely toappear more often in long documents. Therefore wijshould be scaled by some variable related to documentlength.

Term Frequency: Free-text DocumentA standard method for free-text documentsScale fij relative to the frequency of other terms in thedocument.i This partially corrects for variations in thelength of the documents.Let mi max (fij) i.e., mi is the maximum frequency ofany term in document i.Term frequency (tf):tfij fij / miwhen fij 0Note: There is no special justification for taking thisform of term frequency except that it works well inpractice and is easy to calculate.

WeightingInverse Document Frequency (idf)Suppose term j appears fij times in document i. Whatweighting should be given to a term j?Inverse Document Frequency: ConceptA term that occurs in a few documents is likely to be abetter discriminator than a term that appears in most orall documents.

Inverse Document FrequencySuppose there are n documents and that the number ofdocuments in which term j occurs is nj.A possible method might be to use n/nj as the inversedocument frequency.A standard methodThe simple method over-emphasizes small differences.Therefore use a logarithm.Inverse document frequency (idf):idfj log2 (n/nj) 1nj 0Note: There is no special justification for taking this formof inverse document frequency except that it works well inpractice and is easy to calculate.

Example of Inverse DocumentFrequencyExamplen 1,000 documents; nj # of docs term appears interm jABCDnjidfj1005009001,0004.322.001.131.00idfj modifies only the columns notthe rows!From: Salton and McGill

Full Weighting:A Standard Form of tf.idfPractical experience has demonstrated that weights of the followingform perform well in a wide variety of circumstances:(weight of term j in document i) (term frequency) * (inverse document frequency)A standard tf.idf weighting scheme, for free text documents, is:tij tfij * idfj (fij / mi) * (log2 (n/nj) 1)when nj 0

Discussion of SimilarityThe choice of similarity measure is widely used and workswell on a wide range of documents, but has no theoreticalbasis.1. There are several possible measures other that anglebetween vectors2. There is a choice of possible definitions of tf and idf3. With fielded searching, there are various ways to adjust theweight given to each field.

Similarity Measures Compared Q D Simple matching (coordination level match) Q D 2 Q D Dice’s Coefficient Q D Q D Q D 1Jaccard’s Coefficient1 Q D Q D min( Q , D )22Cosine Coefficient (what we studied)Overlap Coefficient

Similarity Measures A similarity measure is a function which computes the degree ofsimilarity between a pair of vectors or documents– since queries and documents are both vectors, a similarity measurecan represent the similarity between two documents, two queries, orone document and one query There are a large number of similarity measures proposed in theliterature, because the best similarity measure doesn't exist (yet!) With similarity measure between query and documents– it is possible to rank the retrieved documents in the order ofpresumed importance– it is possible to enforce certain threshold so that the size of theretrieved set can be controlled– the results can be used to reformulate the original query inrelevance feedback (e.g., combining a document vector with thequery vector)

Problems Synonyms: separate words that have thesame meaning.– E.g. ‘car’ & ‘automobile’– They tend to reduce recall Polysems: words with multiple meanings– E.g. ‘Java’– They tend to reduce precision The problem is more general: there is adisconnect between topics and words

‘ a more appropriate model should consider someconceptual dimensions instead of words.’(Gardenfors)

Latent Semantic Analysis (LSA) LSA aims to discover something about the meaningbehind the words; about the topics in the documents. What is the difference between topics and words?– Words are observable– Topics are not. They are latent. How to find out topics from the words in an automaticway?– We can imagine them as a compression of words– A combination of words– Try to formalise this

Latent Semantic Analysis Singular Value Decomposition (SVD) A(m*n) U(m*r) E(r*r) V(r*n) Keep only k eigen values from E A(m*n) U(m*k) E(k*k) V(k*n)Convert terms and documents to points in kdimensional space

Latent Semantic Analysis Singular Value Decomposition{A} {U}{S}{V}T Dimension Reduction{ A} { U}{ S}{ V}T

Latent Semantic Analysis LSA puts documents together even if theydon’t have common words if– The docs share frequently co-occurring terms Disadvantages:– Statistical foundation is missingPLSA addresses this concern!

PLSA Latent Variable model for general co-occurrence data Associate each observation (w,d) with a class variable z ЄZ{z 1, ,z K} Generative Model– Select a doc with probability P(d)– Pick a latent class z with probability P(z d)– Generate a word w with probability p(w z)P(d)dP(z d)zP(w z)w

PLSA To get the joint probability model

Model fitting with EM We have the equation for log-likelihoodfunction from the aspect model, and weneed to maximize it. Expectation Maximization ( EM) is used forthis purpose

EM Steps E-Step– Expectation step where expectation of thelikelihood function is calculated with thecurrent parameter values M-Step– Update the parameters with the calculatedposterior probabilities– Find the parameters that maximizes thelikelihood function

E Step It is the probability that a word w occurringin a document d, is explained by aspect z(based on some calculations)

M Step All these equations use p(z d,w) calculated in EStep Converges to local maximum of the likelihoodfunction

The performance of a retrieval system based on this model (PLSI)was found superior to that of both the vector space based similarity(cos) and a non-probabilistic latent semantic indexing (LSI) method.(We skip details here.)From Th. Hofmann, 2000

Comparing PLSA and LSA LSA and PLSA perform dimensionality reduction– In LSA, by keeping only K singular values– In PLSA, by having K aspects Comparison to SVD– U Matrix related to P(d z) (doc to aspect)– V Matrix related to P(z w) (aspect to term)– E Matrix related to P(z) (aspect strength) The main difference is the way the approximation is done– PLSA generates a model (aspect model) and maximizes itspredictive power– Selecting the proper value of K is heuristic in LSA– Model selection in statistics can determine optimal K in PLSA

Summary: Vector Similarity Computation with Weights Documents in a collection are assigned terms from a set of n terms The term vector space W is defined as: if term k does not occur in document d i, w ik 0 if term k occurs in document d i, w ik is greater than zero (wik is called the weight of term k in document d i) Similarity between d i