An Introduction To Hidden Markov Models - Stanford University

Transcription

An Introduction to HiddenMarkov ModelsL. R. .RabinerB. H. JuangThebasictheoryofMarkovchainshasbeen known tomathematicians and engineersfor close to 80 years, but it isonly in the past decadethat it has been applied explicitly toproblems in speech processing. Oneof the major reasons whyspeech models, basedon Markovchains, havenot been developed until recently was the lack of a methodfor optimizingthe parameters of the Markov modelto match observed signalpatterns. Such a method was proposedin the late1960’s andwas immediately applied to speech processing in several research institutions. Continiued refinementsin the theory andimplementation of Markov modelling techniques have greatlyenhanced the method, leadingto awide,range of applicationsof these models. It is the purpose of this tutorial paper togive an introduction to,the theory.of Markov models, and toillustrate how theyhave been appliedto problems in speechrecognition.4IEEE ASSP MAGAZINE JANUARY 19860740-7467/86/0100-0004 01.00@1986IEEE

appropriate excitation.The easiest waythen to address thetime-varying natureof theaprocess is to view it as a directconcatenation of thesesmaller ”short time” segments,eachsuchsegment being individually represented by.alinear system model. In other words, the overall model isa synchronous sequence of symbols where each of thesymbols is a linear system model representing a short seg,merit of the process. In a sense this type of approachmodels the observed signal using representative tokensofthe signal itself (or some suitably averaged set of such,signals if we have multiple observations).Time-varying processesModeling time-varying processes with the aboveapproach assumes that every such short-time segmentofobservation is a unit with a prechosen duration. In general,hqwever, there doesn’texist a precise procedureto decide what the unit duration shouldbe so that boththe time-invariant assumption holds, and the short-timelinear system models (as well as concatenation of the models) are meaningful. In most physical systems, thedurationof ashort-time segment is determined empirically. Inmany processes, of.course, one would neither expect theproperties of the process to change synchronously withevery unit analysis duration, nor observe drastic changesfrom each unit to the next except at certain instances.Making nofurther assumptions aboutthe relationship between adjacent short-time models, and treating temporalvariations, small or large, as “typical” phenomena in theobserved signal, are key featuresin the above direct concatenation technique. This template approachto signalmodeling has proven to be quite useful and has been thebasis of a wide variety of speech recognition systems.There are good reasonsto suspect, atthis point, that theabove approach, while useful, may not be the most effi-cient (interms of computation, storage, parameters etc.)technique as far as representation is concerned. Many realworld processesseem to manifest a rather sequentiallychanging behavior; the properties ofthe process are usually held pretty steadily,except for minor fluctuations,for a certain period of time (or a number of the abovementioned durationunits), and then, at certain instances,change (gradually or rapidly) to another set of properties.The opportunity for more efficient modeling can be ex.plaited if wecan first identify theseperiods of rathersteadily behavior, andthen are willing to assume that thetemporal variations within each of these steady periodsare, in a sense, statistical. A more efficient representationmay then be obtained by using a common short .timemodel for each of the steady, or well-behaved partsof thesignal, along with some characterization of how onesuch period evolves to the next. This is howhiddenMarkov models (HMM) come about. Clearly, three problems have to be addressed: 1) howz’these steadily or distinctively behaving periods can be identified, 2) how the“sequentially”evolvingnature of theseperiods canbecharacterized, and 3) what typical or common short timemodel should be chosen for each of these periods. Hid-den Markov models successfully treat these problemsunder a probabilistic or statistical framework.It is thus the purpose of this paper to explain- what ahiddenJvlarkovmodel is, why it is appropriate for certaintypes of problems, and how it can be used in practice. Inthe next section, we illustrate hidden Markovmodels viasome simple coin tossexamplesand outline the threefundamental problems associatedwith the modelingtechnique. We then discuss how these problems can be solvedin Section Ill. We will not direct our general discussiontoany one particular problem,but at theend of this paperweillustrate how HMM’s are used viaa couple ofexamples inspeech recognition.DEFINITION OFA HIDDEN MARKOV MODELAn HMM is a doubly stochastic process with an underlying stochastic process that is not observable (it is hidden), but can only be observed through another set ved symbols. We illustrate HMM’s with the followingIcoin toss’example.Coin toss exampleTo understand the concept of the HMM, consider thefollowing simplified example.Youare in a room with abarrier (e.g., a,curtain) through which youcannot seewhat is happening. On the other side of the barrier isa,notherperson who is performing a coin (or multiplecoin) tossing experiment. The other person will not tellyou anything about what heis doing exactly; he will onlytell you the result of each coin flip. Thus a sequence ofhidden coin tossing experiments is performed, and youonly observe the results of the coin tosses, i.e. . . . . :. . . OT0, o20 3 .*xwhere stands for heads and T stands for tails.Given the above experiment, theproblem is how dowebuild an HMM toexplain the observed sequenceof headsand tails. One possible model is shown in Fig. l a . We callthis the “l-fair coin” model. There are two states in themodel, but each state is uniquely associated with eitherheads (state 1) or tails (state 2). Hence this model is nothidden because the observation sequence uniquely defines the state. The model represents a “fair coin” becausethe probability.of generating a head (or a tail) following ahead (or a tail) is 0.5; hence thereis no bias onthe currentobservation: This is a degenerate example and showshowindependent trials,like tossing of a fair coin, can beinterpreted as a set of sequentialevents. Of course, if theperson behind th.e barrier is, in fact, tossing a single faircoin, this model should explain the outcomes very well.A, second possible HMM for explaining the observedsequence of coin toss outcomes is given iri Fig. I b . We callthis model the “2-faircoin” model. There are again2 statesin the model, but neitherState is uniquely associated withJANUARY 1986 IEEE ASSP MAGAZINE5

vathn probability distributions which, of course, represent random variables or stochastic processes.Usingthemodel,an observationsequence,0, Op, . ,OT,is generated as follows:.JANUARY 1986 IEEE ASSPMAGAZINE0 .7

JANUARY 1986 IEEE ASSP MAGAZINE9

10IEEE ASSPMAGAZINEJANUARY1986

sequence giventhe model. Thisis the most difficult of thethree problems wehave discussed. Thereis no known wayto solve for a maximum likelihood model analytically.Therefore aniterative procedure, suchas the Baum-Welchmethod, or gradient techniques for optimization must beused. Here wewill only discuss the iterative procedure. Itappears that with this procedure, the physical meaningofvarious parameter estimates canbe easily visualized.To describe how we (re)estimate HMM parameters, wefirst define t,(i,j).asi.e. the probability ofa path being in state qi at time t andmaking a transition to state qi at time t 1, given theobservation sequence and the model.' FromFig. 5 itshould be clear that we can write t t ( i , j ) asIIIn the. above, at(i)accounts for the first t observations,ending in state qi at time t, the term aiibj(Ot Jaccountsfor the transition to state 9j at time t 1 with the.occurrence of symbol Ot ,, and the term pt l(j) accounts for

12,IEEE ASSP.MAGAZINE JANUAPY 1986

JANUARY 1986 IEEE ASSPMAGAZINE13

rational information is often represented in a normalizedform for word models, (sincethe word boundary is essentially known), in the form:pi(//T) probabidity of being in state j for exactly (/IT)ofthe word, whereT is the numberof,frames in theof Pr(0, / 1 A) is usually very large and max, Pr(0, / I A) isusually the only significant term in' the summation forPr(0 /A). Therefore, in such cases,, either the forwardbackwakd procedure or the Viterbi algorithm worksequally well in the word recognition task.REFERENCES[I]Baker, J.K., "The Dragon System-AnOverview,''IEEE Trans. on Acoustics Speech Signal Processing,Vol. ASSP-23, No. 1, pp. 24-9, February 1975.121 Jelinek, F., "Continuous Speech Recognition by Statistical Methods," Proc. / ,E, Vol.pp. 532-556,April 1976.64,

16IEEE ASSP MAGAZINE JANUARY 1986

An Introduction to Hidden Markov Models The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is . may then be obtained by using a common short .time model for each of the steady, or well-behaved parts of the signal, along with some characterization of how one .