CHATBOT BASED QUESTION ANSWERING SYSTEM

Transcription

CHATBOT BASED QUESTION ANSWERING SYSTEMByYASH SINHAA THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEUNIVERSITY OF FLORIDA2018

2018 Yash Sinha

To my family, friends and teachers

ACKNOWLEDGMENTSI thank Professor Andy Li, Professor Sanjay Ranka and Professor Daisy Wangfor their continuous support and guidance. I would like to thank Purnendu Mukherjeeand other members of Li Lab for all the help during my thesis. I am also thankful to myfamily and friends who gave me their love and support. Finally, I would like to thank allthe people associated with the University of Florida, especially my advisor AdrienneCook for all the support during my academic life here.4

TABLE OF CONTENTSpageACKNOWLEDGMENTS .4LIST OF FIGURES .7LIST OF ABBREVIATIONS .9ABSTRACT .10CHAPTER1INTRODUCTION .122DEEP LEARNING BASICS .152.1 Deep Learning .152.2 Convolutional Neural Network .162.3 Recurrent Neural Network .172.4 Word Embedding .192.5 Attention Mechanism .202.6 Pointer Networks .212.7 Memory Networks .222.8 Reinforcement Learning .233LITERATURE REVIEW .263.1 Dynamic Memory Networks .263.2 Match-LSTM And Answer Pointer .293.3 Character Aware .313.4 R-NET .333.5 Bi-Directional Attention Flow .343.6 Reinforced Mnemonic Reader.393.7 FusionNet .413.8 Summary .444MULTI-ATTENTION.474.1 Multi-Attention For Machine Comprehension .474.1.1 Intuition .474.1.2 Model .484.1.3 Training .504.1.4 Summary .514.2 Chatbot.514.3 Attention Visualization .535

5CONCLUSION AND FUTURE DIRECTION .56LIST OF REFERENCES .59BIOGRAPHICAL SKETCH .646

LIST OF FIGURESFigurepage1-1An example from the SQuAD dataset.122-1An artificial neural network .162-2A convolutional neural network .172-3A recurrent neural network architecture .182-4An LSTM architecture .192-5t-SNE visualizations of word embeddings .202-6Attention mechanism .212-7Q-learning equation .242-8Policy learning equation .253-1Overview of DMN modules. .273-2Example from Facebook bAbI dataset.283-3Match-LSTM and Answer Pointers boundary model.303-4Equations for calculating attention vector .303-5Char Aware language model applied to an example sentence .323-6Architecture of R-NET. .333-7Architecture of BiDAF model .353-8The Bi-Directional Attention Model with self-attention.363-9Context-to-Query Attention vector .373-10 Query-to-Context Attention vector .373-11 Attention Matrix for BiDAF .383-12 Architecture of Reinforced Mnemonic Reader. .393-13 History of Word concept .423-14 Architecture of FusionNet model.437

3-15 Summarized view of fusion process used .453-16 Performance comparison on SQuAD dataset .463-17 Performance comparison on TriviaQA dataset .464-1Multi-Attention for Machine Comprehension .484-2Chatbot module flow.524-3OneTask Chatbot .534-4Attention Visualization on wrong samples .555-1Syntax Tree .588

LIST OF ABBREVIATIONSCNNConvolution Neural NetworksDMNDynamic Memory NetworkDMNDynamic Memory NetworksGRUGated Recurrent UnitLSTMLong Short-Term MemoryMCMachine ComprehensionNERNamed-Entity RecognitionNLMNeural Language ModelNLPNatural Language ProcessingNMTNeural Machine TranslationOOVOut of VocabularyPOSPart-of-SpeechPtr-NetPointer NetworksQAQuestion AnsweringReLURectified Linear UnitRLReinforcement LearningRNNRecurrent Neural NetworksSFUSemantic Fusion Unit9

Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Master of ScienceCHATBOT BASED QUESTION ANSWERING SYSTEMByYash SinhaMay 2018Chair: Xiaolin LiMajor: Computer ScienceChatbots – also known as “conversational agents” – are software applicationsthat mimic written or spoken human speech for the purposes of simulating aconversation or interaction with a real person. Traditional chatbots have made heavyuse of NLP concepts like slot filling. In this work, we have used deep learning models tobuild a chatbot for flight booking. We first start with the task of Machine Comprehension,which is one of the most important problems in Natural Language Processing. Therehave been mainly three reasons for the recent advancement in this task. The attentionmechanism, the concept of Pointer Networks to constrain the output tokens to be fromthe input sequences and character embedding to solve the problem of OOV words.Apart from these, Memory Networks have helped in solving multi-sentence reasoningproblem and Reinforcement Learning has been used in the training phase as theoptimization strategy when answer boundary is fuzzy or too long.There also has been research to improve the encoding layer by using parts-ofspeech tags and named-entity tags. We perform a literature review of some of theimportant existing approaches, we also explore their performance and present theirresults. We also introduce the concept of mixed attention to capture the query10

information along with the context information. We see that the deep learning modelscan be effectively used in machine comprehension tasks as well as fully functionalchatbots.11

CHAPTER 1INTRODUCTIONQuestion answering task has been one of the most important problems in NLP.Teaching machines to read, process and comprehend and then answer questions hasproved to be a challenging task. The Machine Comprehension task [1] is as follows.Given a passage P and question Q, our task is to predict an answer A to question Qbased on information found in P. Answer A often includes non-entities and can be muchlonger phrases. This setup challenges us to understand and reason about both thequestion and passage in order to infer the answer. An example of the task can be seenin Figure 1-1.Figure 1-1. An example from the SQuAD [2] datasetThis proves to be challenging because of many reasons. Firstly, the traditionalmodels couldn’t capture OOV words. This is a problem when the word in the test setisn’t present in the training set and thus, the network is not able to generate wordembedding vectors for a new word. The next challenge is the multi-sentence reasoning.Humans increase their understanding by re-reading the context and the query. To helpthe machine achieve that kind of understanding, Memory Networks [3, 4] have proved tobe quite useful to incorporate multi-hop reasoning. Another challenge is that we fail tofully capture the long-distance contextual interaction between parts of the context, by12

only using LSTM [5] or GRU [6]. The problem here is that the answer candidate is oftenoblivious to important cues in the passage outside its surrounding window. Fusing thecontext into itself has proved to be useful to handle this problem. One of the challengesthat has not yet been solved to a good extent is that we fail to capture the semantics ofthe passage. There have been various approaches in Neural Language Models tocapture some of the semantics by using POS tagging, NER tagging, and using binarymatching between context and passage. However, these too have been unable tocompletely capture the semantics of the passage.The task of machine comprehension can be effectively translated into a chatbotsystem. We can think of chatbot as a question answering system where the paragraphP is the text entered by the user. To cater a specific domain, like that of flight bookingtask, we can generate a set of internal questions to related to the entities required forthe search. These will serve as question Q. The answer A will be the values of thoseentities. We will delve into the chatbot systems in more detail in the later chapters.Traditionally, the chatbots have used the concepts of NLP, like that of slot filling.Imagine we want to create an agent for a t-shirt shop [7]. In order to place an order, weneed to know the following information: ColorSizeNumberKnowing this, we can expect a user's conversation to go something like this:User: I'd like to buy a t-shirt.Agent: Great, first tell me a couple of things.Agent: How many do you want?13

User: 3Agent: What color would you like?User: BlackAgent: What size?User: MediumAgent: Got it, that was three, black t-shirts in medium. Did I get that right?User: Yes!Agent: Alright thanks, would you like to keep shopping or checkout?To use slot-filling, we set these three parameter values to be required. If users omit oneor more of the parameters in their response, we prompt them to provide values for eachmissing parameter. The problem with these approaches is that we need to generate alot of statements, wherein the answer from the user can be used to fill the slots. In manycases, it becomes virtually impossible to cover all the cases. In the cases where wewant to generalize these statements, deep learning models prove to be very effective. Inthe next section, we will cover the basics of deep learning and other concepts which willprovide a foundation to some of the state-of-the-art question answering models.14

CHAPTER 2DEEP LEARNING BASICS2.1 Deep LearningDeep learning [8] (also known as deep structured learning or hierarchicallearning) is part of a broader family of machine learning methods based on learningdata representations, as opposed to task-specific algorithms. Learning can besupervised, semi-supervised or unsupervised. Neural networks [9] are a set ofalgorithms, modeled loosely after the human brain, that are designed to recognizepatterns. They interpret sensory data through a kind of machine perception, labeling orclustering raw input. The patterns they recognize are numerical, contained in vectors,into which all real-world data, be it images, sound, text or time series, must betranslated.Neural networks help us cluster and classify. We can think of them as aclustering and classification layer on top of data we store and manage. They help togroup unlabeled data according to similarities among the example inputs, and theyclassify data when they have a labeled dataset to train on. To be more precise, neuralnetworks extract features that are fed to other algorithms for clustering andclassification, so we can think of deep neural networks as components of largermachine-learning applications involving algorithms for reinforcement learning,classification and regression.Figure 2-1 gives an example of a basic neural network. A deep learning networkis composed of several layers. The layers are made of nodes. A node is just a placewhere computation happens, loosely patterned on a neuron in the human brain, whichfires when it encounters sufficient stimuli. A node combines input from the data with a15

set of coefficients, or weights, that either amplify or dampen that input, therebyassigning significance to inputs for the task the algorithm is trying to learn These inputweight products are summed and the sum is passed through a node’s activationfunction, to determine whether and to what extent that signal progresses further throughthe network to affect the ultimate outcome, say, an act of classification. The weights canbe tuned based on the data.Figure 2-1. An artificial neural networkThe connections have weights that can be tuned based on the data, makingneural nets capable of learning. The tuning is done through an algorithm calledbackpropagation based on gradient descent. In the next few sections, we will coversome of the basics of deep learning.2.2 Convolutional Neural NetworkConvolutional networks [10, 11] are deep artificial neural networks that are veryeffective for image recognition and classification. They have been proved successful inidentifying faces, individuals, objects, traffic signs and powering vision in robots andself-driving cars. Figure 2-2 [12] illustrates the architecture of a basic CNN.16

Figure 2-2. A convolutional neural networkThere are primarily four operations in a standard CNN model: Convolution: The primary purpose of Convolution in the ConvNet is to extractfeatures from the input image. The spatial relationship between pixels, i.e. theimage features are preserved and learned by the convolution using smallsquares of input data.Non-Linearity (ReLU): Rectified Linear Unit (ReLU) [13] is a non-linear operationthat carries out an element wise operation on each pixel. This operation replacesthe negative pixel values in the feature map by zero.Pooling or Sub Sampling - Spatial Pooling reduces the dimensionality of eachfeature map but retains the most important information. For max pooling, thelargest value in the square window is taken and rest are dropped. Other types ofpooling are Average, Sum etc.Classification (Fully Connected Layer): The Fully Connected layer is a traditionalMulti-Layer Perceptron as described before, that uses a softmax activation [14]function in the output layer. The high-level features of the image are encoded bythe convolutional and pooling layers which is then fed to the fully connected layerwhich then uses these features for classifying the input image into variousclasses based on the training dataset.When a new image is fed into the CNN model, all the above-mentioned steps arecarried out (forward propagation) and a probability distribution is achieved on the set ofoutput classes. With a large enough training dataset, the network will learn andgeneralize well enough to classify new images into their correct classes.2.3 Recurrent Neural NetworkRNNs [15] are one of the popular deep learning architectures. They have showngreat promise in many NLP tasks. The idea behind RNNs is to make use of sequential17

information. In a traditional neural network, we assume that all inputs (and outputs) areindependent of each other. But for many tasks that’s a very bad idea. If we want topredict the next word in a sentence, we want to know which words came before it.RNNs are called recurrent because they perform the same task for every element of asequence, with the output being depended on the previous computations. Another wayto think about RNNs is that they have a “memory” which captures information aboutwhat has been calculated so far. In theory, RNNs can make use of information inarbitrarily long sequences, but in practice, they are limited to looking back only a fewsteps. This is because of the vanishing gradients. Figure 2-3 [16] depicts a basic RNNarchitecture.Figure 2-3. A recurrent neural network architectureTo counter the vanishing gradient problem, other special kinds of RNNs aregenerally used. Long Short-Term Memory networks are capable of learning long-termdependencies. They work tremendously well on a large variety of problems. The LSTMhave the ability to remove or add information to the cell state, carefully regulated bystructures called gates. Gates are a way to optionally let information through. They arecomposed out of a sigmoid neural net layer and a pointwise multiplication operation. AnLSTM has three kinds of gates: Forget Gate LayerInput Gate LayerOutput Gate Layer18

A forget gate layer decides what information to throw away from the cell state.The input gate layer decides which values to update. The output gate layer decides howmuch of the cell should be sent ahead. Figure 2-3 [17] depicts the architecture of atypical LSTM. Another variation of RNN is a Gated Recurrent Unit which has two gates,reset and update gates.Figure 2-4. An LSTM architecture2.4 Word EmbeddingWord embeddings [19] are one of the most interesting research areas of deeplearning. Word representations like word embedding, have been used across varioustasks like part-of-speech tagging, named entity recognition [20], parsing and semanticrole labeling. Any word of a dictionary (the set of words recognized for the specific task)is transformed into a numeric vector of a certain number of dimensions. A wordembedding W : words ℝn is a parameterized function mapping words in somelanguage to high-dimensional vectors. Word embeddings are the texts converted intonumbers and there may be different numerical representations of the same text.One of the tasks of word embedding is predicting whether an n-gram is valid.This is particularly helpful in detecting grammatical errors in the text. Another commontask is predicting the next word in the sentence. Word embeddings make a lot ofintuitive sense in NLP tasks. Similar words are close together and are closest in the19

embedding. These words tend to be quite similar. Figure 2-5 [19] shows t-SNEvisualizations of word embedding.Figure 2-5. t-SNE visualizations of word embeddingsSome of the most popular word embedding models are Word2Vec [21], GloVe[22] and CoVe [23]. Word2vec is a particularly computationally-efficient predictive modelfor learning word embeddings from raw text. Its input is a text corpus and its output is aset of vectors: feature vectors for words in that corpus. It detects similaritiesmathematically and creates vectors that are distributed numerical representations ofword features, features such as the context of individual words. GloVe seeks to makeexplicit what word2vec does implicitly. It encodes the meaning as vector offsets in anembedding space, seemingly only a serendipitous by-product of word2vec, is thespecified goal of GloVe.2.5 Attention MechanismAttention Mechanisms [26] in Neural Networks are very loosely based on thevisual attention mechanism found in humans. With attention mechanism, we allow thedecoder to “attend” to different parts of the source sentence at each step of the outputgeneration. The model learns what to attend, based on the input sentence and what it20

has produced so far. Figure 2-6 [27] depicts the attention mechanism. Each decoderoutput word yt depends on a weighted combination of all the input states. The a’s arethe weights that define in how much of each input state should be considered for eachoutput. So, if a3,2 is a large number, the decoder pays a lot of attention to the secondstate in the source sentence while producing the third word of the target sentence. Thea's are typically normalized to sum to 1.Figure 2-6. Attention mechanismAttention mechanism have proved very useful in machine comprehension tasksto predict the answers from the input passage. We calculate the attention of thequestion over the passage to get a query-aware context. This helps us in finding outwhich query words are most relevant to each context word. The query-to-contextattention signifies which of the context words have the closest similarity to one of thequery words. We will explore this in more detail in later chapters.2.6 Pointer NetworksTraditionally, the sequence-to-sequence models have been used for tasks likeNeural Machine Translation. In NMT, we map the meaning of a sentence into a fixedlength vector representation and then generate a translation based on that vector.21

These models use RNN to map an input sequence to an embedding and another RNNto map the embedding to an output sequence. However, these methods still require thesize of the output dictionary to be fixed a priori. Thus, we cannot directly apply this toproblems where the size of the output depends on the length of the input. Pointernetworks [29] are a variation of the sequence-to-sequence model with attention. Theoutput of pointer network are pointers to the elements of the input sequence. This helpsus the problem of the output being fixed length vectors. The pointer networks also helpin solving the out-of-vocabulary problem to some extent.2.7 Memory NetworksMemory Networks [3] is an attention-based neural network architecture thatoperates an external symbolic memory component to perform reasoning. This canachieve interesting performances on various tasks of question answering and dialoguemanagement and appears to be a promising avenue towards a better machinecomprehension of language. A memory network combines learning strategies from themachine learning literature with a memory component that can be read and written to.The model is trained to learn how to operate effectively with the memory component.The high-level view of a memory network is as follows: Memory m, which is an array of objects like vectors or array of stringsInput feature map I, which converts the incoming input to the internal featurerepresentationGeneralization component G, which updates old memories given the new input.Output feature map O, which produces a new output (in the featurerepresentation space), given the new input and the current memory state.Response R, which converts the output into the response format desired.Here, the components I, G, O and R can be potentially learned. In the case of aquestion answering system, we can use the components in the following way. I can22

make use of standard pre-processing such as parsing, co-reference, and entityresolution. It could also encode the input into an internal feature representation byconverting from text to a sparse or dense feature vector. G can be used by introducing afunction H which maps the internal feature representation produced by I to an individualmemory slot, and just updates the memory at H (I (x)). In this case, G updates the indexH (x) of m, but all other parts of the memory remain untouched. O reads from memoryand performs inference to deduce the set of relevant memories needed to perform agood response. R can be used to produce the actual wording of the question answerbased on the memories found by O, for example, a textual response.2.8 Reinforcement LearningReinforcement Learning [31] refers to a kind of Machine Learning method inwhich the agent receives a delayed reward in the next time step to evaluate its previousaction. There’s no answer key and the reinforcement learning agent has to decide howto act to perform its task. In the absence of existing training data, the agent learns fromexperience. It collects the training through trial-and-error as it attempts its task, with thegoal of maximizing long-term reward. An RL setup is generally composed of twocomponents, an agent and an environment. The environment refers to the object thatthe agent is acting, and the agent is the RL algorithm. The learning starts with theenvironment sending a state to the agent. The agent takes an action in response to thatstate. Now, the environment sends a pair of next state and reward to the agent. Theagent then updates its knowledge based on the reward returned by the environment.The loop keeps going on until the environment sends a terminal state, which ends theepisode. There are two popular type of RL algorithms, Q-learning and Policy Learning.23

Q-learning [33] is a type of RL algorithm in which we evaluate the action to takebased on an action-value function. This then determines the value of being in a certainstate. The agent decides what action to take based on the state. The input to the Qfunction is one state and one action and it returns the expected reward of that action atthat state. The following equation [31] shows how the value of Q is updated, based onthe reward we get from the environment.Figure 2-7. Q-learning equationThe learning rate alpha shows how aggressive we want to be when updating ourvalue. The reward is obtained by taking action at at state st. This reward is added to theold estimate. The estimate of optimal future value is the maximum achievable reward Qfor all available actions at xt 1. The old value of Q is subtracted to make sure that onlythe difference in the estimate is incremented or decremented. The action is taken basedon the action-selection strategy.Another popular RL algorithm is the Policy Learning [33]. In policy learning, thestate is mapped to the action. Unlike the Q-learning algorithm where the value functionis learned to estimate the value of each state-action pair, a policy is a map from state toaction. It implies that Policy Learning learns the Q-value based on the action performedby the current policy instead of the greedy policy. The Q-learning has no constraint overthe next action, as long as it maximizes the Q-value for the next state. The equation [31]for policy learning is similar to Q-learning.24

Figure 2-8. Policy learning equationThe next chapter explores the existing state-of-art models for QuestionAnswering task.25

CHAPTER 3LITERATURE REVIEW3.1 Dynamic Memory NetworksThe paper [35] is based on the concept that most tasks in natural languageprocessing can be cast into question answering (QA) problems over language input.The authors introduce a neural network architecture called dynamic memory network,which processes input sequences and questions, forms episodic memories, andgenerates relevant answers. There is an iterative attention process, which i

I thank Professor Andy Li, Professor Sanjay Ranka and Professor Daisy Wang for their continuous support and guidance. I would like to thank Purnendu Mukherjee and other members of Li Lab for all the help during my thesis. I am also thankful to my family and friends who gave me their love and sup