A Latent Semantic Model With Convolutional-Pooling .

Transcription

A Latent Semantic Model with Convolutional-PoolingStructure for Information RetrievalYelong ShenXiaodong HeJianfeng GaoMicrosoft ResearchRedmond, WA, USAMicrosoft ResearchRedmond, WA, USAMicrosoft ResearchRedmond, WA, icrosoft.comABSTRACTIn this paper, we propose a new latent semantic model thatincorporates a convolutional-pooling structure over wordsequences to learn low-dimensional, semantic vectorrepresentations for search queries and Web documents. In order tocapture the rich contextual structures in a query or a document, westart with each word within a temporal context window in a wordsequence to directly capture contextual features at the word ngram level. Next, the salient word n-gram features in the wordsequence are discovered by the model and are then aggregated toform a sentence-level feature vector. Finally, a non-lineartransformation is applied to extract high-level semanticinformation to generate a continuous vector representation for thefull text string. The proposed convolutional latent semantic model(CLSM) is trained on clickthrough data and is evaluated on a Webdocument ranking task using a large-scale, real-world data set.Results show that the proposed model effectively captures salientsemantic information in queries and documents for the task whilesignificantly outperforming previous state-of-the-art semanticmodels.Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: Information Searchand Retrieval; I.2.6 [Artificial Intelligence]: LearningGeneral TermsAlgorithms, ExperimentationKeywordsConvolutional Neural Network; Semantic Representation; WebSearch1. INTRODUCTIONMost modern search engines resort to semantic based methodsbeyond lexical matching for Web document retrieval. This ispartially due to the fact that the same single concept is oftenexpressed using different vocabularies and language styles indocuments and queries. For example, latent semantic models suchas latent semantic analysis (LSA) are able to map a query to itsPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Request permissions from permissions@acm.org.CIKM’14, November 3–7, 2014, Shanghai, China.Copyright 2014 ACM 978-1-4503-2598-1/14/11 15.00http://dx.doi.org/10.1145/2661829.2661935Li DengGrégoire MesnilMicrosoft Research University of MontréalMontréal, CanadaRedmond, WA, elevant documents at the semantic level where lexical matchingoften fails (e.g., [9][10][31]). These models address the problemof language discrepancy between Web documents and searchqueries by grouping different terms that occur in a similar contextinto the same semantic cluster. Thus, a query and a document,represented as two vectors in the low-dimensional semantic space,can still have a high similarity even if they do not share any term.Extending from LSA, probabilistic topic models such asprobabilistic LSA (PLSA), Latent Dirichlet Allocation (LDA),and Bi-Lingual Topic Model (BLTM), have been proposed andsuccessfully applied to semantic matching [19][4][16][15][39].More recently, semantic modeling methods based on neuralnetworks have also been proposed for information retrieval (IR)[16][32][20]. Salakhutdinov and Hinton proposed the SemanticHashing method based on a deep auto-encoder in [32][16]. ADeep Structured Semantic Model (DSSM) for Web search wasproposed in [20], which is reported to give very strong IRperformance on a large-scale web search task when clickthroughdata are exploited as weakly-supervised information in trainingthe model. In both methods, plain feed-forward neural networksare used to extract the semantic structures embedded in a query ora document.Despite the progress made recently, all the aforementionedlatent semantic models view a query (or a document) as a bag ofwords. As a result, they are not effective in modeling contextualstructures of a query (or a document). Table 1 gives severalexamples of document titles to illustrate the problem. Forexample, the word “office” in the first document refers to thepopular Microsoft product, but in the second document it refers toa working space. We see that the precise search intent of the word“office” cannot be identified without context.microsoft office excel could allow remote code executionwelcome to the apartment officeonline body fat percentage calculatoronline auto body repair estimatesTable 1: Sample document titles. The text is lower-cased andpunctuation removed. The same word, e.g., “office”, hasdifferent meanings depending on its contexts.Modeling contextual information in search queries anddocuments is a long-standing research topic in IR[11][25][12][26][2][22][24]. Classical retrieval models, such asTF-IDF and BM25, use a bag-of-words representation and cannoteffectively capture contextual information of a word. Topicmodels learn the topic distribution of a word by considering wordoccurrence information within a document or a sentence.However, the contextual information captured by such models is

often too coarse-grained to be effective for the retrieval task. Forexample, the word office in “office excel” and “apartment office”,which represent two very different search intents when used insearch queries, are likely to be projected to the same topic. As analternative, retrieval methods that directly model phrases (or wordn-grams) and term dependencies are proposed in [12][25][26]. Forexample, in [25], the Markov Random Field (MRF) is used tomodel dependencies among terms (e.g., term n-grams and skipgrams) of the query and the document for ranking, while in [26] alatent concept expansion (LCE) model is proposed whichleverages the term-dependent information by adding n-gram and(unordered n-gram) as features into the log-linear ranking model.In [12] a phrase-based translation model was proposed to learn thetranslation probability of a multi-term phrase in a query given aphrase in a document. Since the phrases capture richer contextualinformation than words, more precise translations can bedetermined. However, the phrase translation model can only scorephrase-to-phrase pairs observed in the clickthrough training dataand thus generalize poorly to new phrases.In this study, we develop a new latent semantic model basedon the convolutional neural network with convolution-poolingstructure, called the convolutional latent semantic model (CLSM),to capture the important contextual information for latent semanticmodeling. Instead of using the input representation based on bagof-words, the new model views a query or a document 1 as asequence of words with rich contextual structure, and it retainsmaximal contextual information in its projected latent semanticrepresentation. The CLSM first projects each word within itscontext to a low-dimensional continuous feature vector, whichdirectly captures the contextual features at the word n-gram level(detailed in section 3.3). Second, instead of summing over allword n-gram features uniformly, the CLSM discovers andaggregates only the salient semantic concepts to form a sentencelevel feature vector (detailed in section 3.4). Then, the sentencelevel feature vector is further fed to a regular feed-forward neuralnetwork, which performs a non-linear transformation, to extracthigh-level semantic information of the word sequence. In training,the parameters of the CLSM is learned on clickthrough data.Our research contributions can be summarized as follows: We propose a novel CLSM that captures both the word ngram level and sentence-level contextual structures for IRusing carefully designed convolution and pooling operations; We carry out an extensive experimental study on theproposed model whereby several state-of-the-art semanticmodels are compared, and we achieve a significantperformance improvement on a large-scale real-world Websearch data set; We perform an in-depth case analysis on the capacity of theproposed model, through which the strength of the CLSM isclearly demonstrated.1In modern search engines, a Web document is described bymultiple fields [12][38], including title, body, anchor text etc. Inour experiments, we only used the title field of a Web documentfor ranking. In addition to providing simplicity for fastexperimentation, our decision is motivated by the observationthat the title field gives better single-field retrieval result thanbody, although it is much shorter (as shown in Table 4). Thus itcan serve as a reasonable baseline in our experiments.Nevertheless, our methods are not limited to the title field, andcan be easily applied to the multi-field description.2. RELATED WORK2.1 Modeling Term Dependencies for IRAlthough most traditional retrieval models assume theoccurrences of terms to be completely independent, contextualinformation is crucial for detecting particular search intent of aquery term. Thus, research in this area has been focusing oncapturing term dependencies. Early work tries to relax theindependence assumption by including phrases, in addition tosingle terms, as indexing units [6][36]. Phrases are defined bycollocations (adjacency or proximity) and selected on thestatistical ground, possibly with some syntactic knowledge.Unfortunately, the experiments did not provide a clear indicationwhether the retrieval effectiveness can be improved in this way.Recently, within the framework of language models for IR,various approaches that go beyond unigrams have been proposedto capture certain term dependencies, notably the bigram and trigram models [35], the dependence model [11], and the MRFbased models [25][26]. These models have shown benefit ofcapturing dependencies. However, they focus on the utilization ofphrases as indexing units, rather than the phrase-to-phrasesemantic relationships.The translation model-based approach proposed in [12] tries toextract phrase-to-phrase relationships according to clickthroughdata. Such relationships are expected to be more effective inbridging the gap between queries and documents. In particular,the phrase translation model learns a probability distribution over“translations” of multi-word phrases from documents to queries.Assuming that queries and documents are composed using twodifferent “languages”, the phrases can be viewed as bilingualphrases (or bi-phrases in short), which are consecutive multi-termsequences that can be translated from one language to another asunits. In [12], it was shown that the phrase model is morepowerful than word translation models [3] because words in therelationships are considered with some context words within aphrase. Therefore, more precise translations can be determined forphrases than for words. Recent studies show that this approach ishighly effective when large amounts of clickthrough data areavailable for training [12][15]. However, as discussed before, thephrase-based translation model can only score phrase pairsobserved in the training data, and cannot generalize to newphrases. In contrast, the CLSM can generalize to model anycontext. In our experiments reported in Section 5, we willcompare the CLSM with the word-based and phrase-basedtranslation models.2.2 Latent Semantic ModelsThe most well-known linear projection model for IR is LSA [9]. Itmodels the whole document collection using adocumentterm matrix , where n is the number of documents and d is thenumber of word types. is first factored into the product of threematrices using singular value decomposition (SVD) as, where the orthogonal matrices and are called term anddocument vectors, respectively, and the diagonal elements ofare singular values in descending order. Then, a low-rank matrixapproximation of is generated by retaining only the k biggestsingular values in . Now, a document (or a query) represented bya term vector can be mapped to a low-dimensional conceptvector, where thematrixis called theprojection matrix. In document search, the relevance scorebetween a query and a document, represented respectively by term

vectors and , is assumed to be proportional to their cosinesimilarity score of the corresponding concept vectors and ,according to the projection matrix .Generative topic models are also widely used for IR. Theyinclude Probabilistic Latent Semantic Analysis (PLSA) [19] andits extensions such as Latent Dirichlet Allocation (LDA) [4][39].PLSA assumes that each document has a multinomial distributionover topics (called the document-topic distribution), where eachof the topics is in turn of a multinomial distribution over words(called the topic-word distribution). The relevance of a query to adocument is assumed to be proportional to the likelihood ofgenerating the query by that document. Recently, topic modelshave been extended so that they can be trained on clickthroughdata. For example, a generative model called Bi-Lingual TopicModel (BLTM) is proposed for Web search in [15], whichassumes that a query and its clicked document share the samedocument-topic distribution. It is shown that, by learning themodel on clicked query-title pairs, the BLTM gives superiorperformance over PLSA [15].2.3 Neural-Network-based Semantic ModelsDeep architectures have been shown to be highly effective indiscovering from training data the hidden structures and featuresat different levels of abstraction useful for a variety of tasks[32][18][20][37][7][34]. Among them, the DSSM proposed in [20]is most relevant to our work. The DSSM uses a feed-forwardneural network to map the raw term vector (i.e., with the bag-ofwords

proposed model whereby several state-of-the-art semantic models are compared, and we achieve a significant performance improvement on a large-scale real-world Web search data set; We perform an in-depth case analysis on the capacity of the proposed model, through which the strength of the CLSM is clearly demonstrated. 1 In modern search engines, a Web document is described by multiple fields .