Hidden Markov Models (HMMS) : Prabhleen Juneja Thapar Institute of Engineering & Technology

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

HIDDEN MARKOV

MODELS
(HMMS)

Prabhleen Juneja
Thapar Institute of Engineering & Technology
INTRODUCTION
 Hidden Markov Models (HMMs) is a sequence classifiers. A sequence
classifier or sequence labeler is a model whose job is to assign some label or
class to each unit in a sequence.
 Given a sequence of units (words, letters, morphemes, sentences, whatever)
the job of HMM is to compute a probability distribution over possible labels
and choose the best label sequence.
 The finite-state transducer for morphological analysis is a kind of non-
probabilistic sequence classifier, for example transducing from sequences of
words to sequences of morphemes.
 The HMMs extend this notion by being probabilistic sequence classifiers.
 Sequence labeling tasks come up throughout speech and language
processing. For instance, in part-of-speech tagging, each word in a sequence
has to be assigned a part-of-speech tag.
 Besides part-of-speech tagging, application of these sequence models include
tasks like speech recognition, grapheme-to-phoneme conversion, and named
entity recognition and information extraction.
MARKOV CHAINS
 The Hidden Markov Model is one of the most important machine
learning models in speech and language processing.
 The Hidden Markov Model is descendant of Markov’s models or
chain, sometimes called the observed Markov model.
 Markov chains and Hidden Markov Models are both extensions of the
finite automata.
 A Markov chain is a special case of a weighted automaton. A
weighted finite-state automaton is a simple augmentation of the finite
automaton in which each arc is associated with a probability, indicating
how likely that path is to be taken. The probability on all the arcs
leaving a node must sum to 1.
 Because it can’t represent inherently ambiguous problems (finite
state), a Markov chain is only useful for assigning probabilities to
unambiguous sequences.
MARKOV CHAINS CONTD……
 Fig. a shows a Markov chain for assigning a probability to a sequence of
weather events, where the vocabulary consists of HOT, COLD, and RAINY.
 Fig. b shows another simple example of a Markov chain for assigning a
probability to a sequence of words w1...wn. This Markov chain in fact
represents a bigram language model.
MARKOV CHAINS CONTD……
 A Markov chain as a kind of probabilistic graphical model; a way of
representing probabilistic assumptions in a graph. A Markov chain is specified
by the following components:
Q = q 1q 2 . . . q N a set of N states
A = a01a02 . . . an1 . . . ann a transition probability matrix A,
each ai j representing the probability of moving from
state i to state j such that
n

a
j 1
ij 1

q0, qF a special start state and end (final)


state which are not associated with observations.
MARKOV CHAINS CONTD……
 An alternate representation that is sometimes used for Markov chains doesn’t
rely on a start or end state, instead representing the distribution over initial
states and accepting states explicitly:
 = 1, 2, ...,  N an initial probability distribution over
states. i is the probability that the Markov chain will start in
state i. Some states j may have j = 0, meaning that they cannot
be initial. Also n


i 1
i 1

QA = {qx, qy...} a set QA ⊂ Q of legal accepting states


MARKOV CHAINS CONTD……

Another representation of the same Markov chain for weather shown in


previous Figure (slide4). Instead of using a special start state with a01
transition probabilities, we use the  vector, which represents the
distribution over starting state probabilities. The figure in (b) shows sample
probabilities.
MARKOV CHAINS CONTD…..
 A Markov chain is based an important assumption . In a first-
order Markov chain, the probability of a particular state is
dependent only on the previous state.
 Markov Assumption:
P(qi|q1...qi−1) = P(qi|qi−1)

 Given any Markov model we can assign a probability to any


sequence from the vocabulary using Markov Assumption.
 For example, for the model given in Slide 7, probability of
sequence “cold hot cold hot” can be computed as:
P(cold hot cold hot)
= P(cold|start)  P(hot|cold)  P(cold|hot)  P(hot| cold)
= 0.3  0.2  0.2  0.2 = 0.0024
THE HIDDEN MARKOV MODEL
 A Markov chain is useful when we need to compute a probability for
a sequence of events that we can observe in the world.
 In many cases, however, the events we are interested in may not be
directly observable in the world.
 For example, in part-of- speech tagging we don’t observe part of
speech tags in the world; we saw words, and had to infer the correct
tags from the word sequence. We call the part-of- speech tags
hidden because they are not observed.
 The same architecture is in speech recognition; in that case we’ll see
acoustic events (sounds) in the world, and have to infer the presence
of ‘hidden’ words that are the underlying causal source of the
acoustics.
 A Hidden Markov Model (HMM) allows us to talk about both
observed events (like words that we see in the input) and hidden
events (like part-of-speech tags)
THE HIDDEN MARKOV MODEL CONTD….
 Imagine that a climatologist in the year 2799 studying the
history of global warming. He cannot find any records of the
weather in Baltimore, Maryland, for the summer of 2007, but
he finds Jason Eisner’s diary, which lists how many ice creams
Jason ate every day that summer. Our goal is to use these
observations to estimate the temperature every day.
 In order to simplify this weather task we assuming there are
only two kinds of days: cold (C) and hot (H). So the task is as
follows:
Given a sequence of observations O, each observation is an
integer corresponding to the number of ice creams eaten on
a given day, figure out the correct ‘hidden’ sequence Q of
weather states (H or C) which caused Jason to eat the ice
cream.
THE HIDDEN MARKOV MODEL CONTD….
 An HMM is specified by the following components:
Q = q1q2 . . . qN a set of N states
A = a11a12 . . . an1 . . . ann a transition probability matrix A, each ai j
representing the probability of moving from state i
to state j such that
n

a
j 1
ij 1

O = o1o2 . . . oT a sequence of T observations


B = bi(ot ) a sequence of observation likelihoods:, also called
emission probabilities, each expressing the
probability of an observation ot being generated
from a state i.
q0, qF a special start state and end (final) state which are
not associated with observations, together with
transition probabilities a01a02..a0n out of the start
state and a1F a2F ...anF into the end state
THE HIDDEN MARKOV MODEL CONTD….
THE HIDDEN MARKOV MODEL CONTD….
 For the HMM shown in previous slide (slide 12), there is a
(non-zero) probability of transitioning between any two
states. Such an HMM is called a fully-connected or ergodic
HMM.
 Sometimes, however, we have HMMs in which many of the
transitions between states have zero probability. For example,
in left-to-right (also called Bakis) HMMs.
 The BAKIS state transitions proceed from left to right, as
shown in Figure in next slide (slide 13). In a Bakis HMM,
there are no transitions going from a higher-numbered state
to a lower-numbered state.
THE HIDDEN MARKOV MODEL CONTD….
THE HIDDEN MARKOV MODEL CONTD….
 A first-order Hidden Markov Model is based two simplifying
assumptions.
 First, as with a first-order Markov chain, the probability of a
particular state is dependent only on the previous state:
  Markov Assumption:

P(qi|q1...qi−1) = P(qi|qi−1)
 Second, the probability of an output observation oi is dependent
only on the state that produced the observation qi, and not on
any other states or any other observations
 Output Independence Assumption:

P(oi|q1 . . . qi, . . . , qT , o1, . . . , oi, . . . , oT ) = P(oi|qi)


THE HIDDEN MARKOV MODEL CONTD….
 The Hidden Markov Models is be characterized by three
fundamental problems:
 Problem 1 (Computing Likelihood): Given an HMM  = (A,
B) and an observation sequence O, determine the likelihood
P(O| ).
 Problem 2 (Decoding): Given an observation sequence O and
an HMM  = (A, B), discover the best hidden state sequence Q.
 Problem 3 (Learning): Given an observation sequence O and
the set of states in the HMM, learn the HMM parameters A and
B.
PROBLEM I: COMPUTING LIKELIHOOD
 The first problem is to compute the likelihood of a particular
observation sequence.
 For example, given the HMM in Slide 12, what is the
probability of the sequence 3 1 3?
 For a Markov chain, where the surface observations are the
same as the hidden events, we could compute the probability of
3 1 3 just by following the states labeled 3 1 3 and multiplying
the probabilities along the arcs.
 For a Hidden Markov Model, things are not so simple. We want
to determine the probability of an ice-cream observation
sequence like 3 1 3, but we don’t know what the hidden state
sequence is!
PROBLEM I: COMPUTING LIKELIHOOD CONTD….

 Since we don’t know what the hidden state (weather) sequence was. We’ll
need to compute the probability of ice-cream events 3 1 3 by summing
over all possible weather sequences, weighted by their probability.
 The joint probability of being in a particular weather sequence Q and
generating a particular sequence O of ice-cream events is given by:
n n
P(O, Q) = P(O|Q) × P(Q) = ∏P(oi|qi) × ∏ P(qi|qi−1)
i=1 i=1
The computation of the joint probability of ice-cream observation 3 1 3
and one possible hidden state sequence hot hot cold is as follows:
 P(3 1 3, hot hot cold) = P(hot|start) × P(hot|hot) × P(cold|hot)

×P(3|hot) × P(1|hot) × P(3|cold)

 
PROBLEM I: COMPUTING LIKELIHOOD CONTD….

 So, we can compute the total probability of the observations just by


summing over all possible hidden state sequences:
P(O) = ∑ P(O, Q) = ∑ P(O|Q)P(Q)
Q Q
 For our particular case, we would sum over the 8 three-event
sequences cold cold cold, cold cold hot, etc.
P(3 1 3) = P(3 1 3, cold cold cold)+P(3 1 3, cold cold hot)+P(3 1 3,
hot hot cold)+...
o For an HMM with N hidden states and an observation sequence of

T observations, there are NT possible hidden sequences.


o For real tasks, where N and T are both large, NT is a very large

number, and so we cannot compute the total observation likelihood


by computing a separate observation likelihood for each hidden
state sequence and then summing them up.
COMPUTING LIKELIHOOD: FORWARD ALGORITHM

 The forward algorithm is a kind of dynamic programming


algorithm, i.e., an algorithm that uses a table to store
intermediate values as it builds up the probability of the
observation sequence.
 The forward algorithm computes the observation probability
by summing over the probabilities of all possible hidden state
paths that could generate the observation sequence, but it does
so efficiently by implicitly folding each of these paths.
COMPUTING LIKELIHOOD: FORWARD ALGORITHM

 Each cell of the forward algorithm trellis at ( j) represents the


probability of being in state j after seeing the first t
observations, given the automaton .
 The value of each cell at ( j) is computed by summing over the
probabilities of every path that could lead us to this cell.
 Formally, each cell expresses the following probability:

at ( j) = P(o1, o2 . . . ot , qt = j|l)
which is computed as:
N
at ( j) = ∑at−1(i)ai jb j(ot )
i=1
COMPUTING LIKELIHOOD: FORWARD ALGORITHM

 The three factors that are multiplied in Eq˙ 6.15 in extending


the previous paths to compute the forward probability at time t
are:

 at−1(i) the previous forward path probability from the


previous time step
 ai j the transition probability from previous state qi to current
state q j
 b j(ot ) the state observation likelihood of the observation
symbol ot given the current state j
COMPUTING LIKELIHOOD: FORWARD ALGORITHM

 Forward algorithm uses following three steps:

1) Initialization:
 1( j) = a0 jb j(o1) 1 ≤ j ≤ N

2) Recursion (since states 0 and F are non-emitting)


  N
t ( j) = ∑  t−1(i)ai jb j(ot ); 1 ≤ j ≤ N, 1 < t ≤ T
i=1
3) Termination:
N
P(O|) = ∑ T (i) aiF
i=1
COMPUTING LIKELIHOOD: FORWARD ALGORITHM
COMPUTING LIKELIHOOD: FORWARD ALGORITHM
O(1) O(2) O(3)
3 1 3
COLD 1(C) a2(C) a3(C)
(C) = P(C|start)  P(3| = 1(C)P(C|C)  P(1|C) + = 2(C)P(C|C)  P(3|C) +
C) 1(H)P(C|H)  P(1|C) 2(H)P(C|H)  P(3|C)
=0.2  0.1 = 0.02 = 0.02  0.6  0.3 + 0.32 0.3 = 0.054  0.6  0.1 + 0.0464
0.5 = 0.054 0.3 0.1 = 0.00463

HOT 1(H) a2(H) a3(H)


(H) = P(H|start)  P(3| = 1(C)P(H|C)  P(1|H) + = 2(C)P(H|C)  P(3|H) +
H) 1(H)P(H|H)  P(1|H) 2(H)P(H|H)  P(3|H)
= 0.8 0.4 =0.32 = 0.02  0.4  0.2 + 0.32 0.7 = 0.054  0.4  0.4 + 0.0464
0.2 = 0.0464 0.7 0.4
= 0. 02163

P(O)= 3(C) P(end|C) + 3(H) P(end|H)


= 0.00463  0.5 + 0.02163  0.5
= 0.01313
DECODING: THE VITERBI ALGORITHM
 For any HMM, that contains hidden variables, the task of
determining which sequence of variables is the underlying source
of some sequence of observations is called the decoding task.
 In the ice cream domain, given a sequence of ice cream
observations 3 1 3 and an HMM, the task of the decoder is to
find the best hidden weather sequence
 The most common decoding algorithms for HMMs is the Viterbi
algorithm. Like the forward algorithm, Viterbi is a kind of
dynamic programming, and makes uses of a dynamic
programming trellis (network/grids).
 Each cell of the Viterbi trellis, vt( j) represents the probability that
the HMM is in state j after seeing the first t observations and
passing through the most probable state sequence q0, q1, ..., qt−1,
given the automaton .
DECODING: THE VITERBI ALGORITHM
 The value of each cell vt ( j) is computed by recursively taking the most
probable path that could lead us to this cell.
 . For a given state j at time t, the value vt ( j) is computed as:

N
vt ( j) = max vt−1(i) ai j b j(ot )
i=1
 The three factors that are multiplied in Eq. 6.20 for extending the
previous paths to compute the Viterbi probability at time t are:

 vt−1(i)- the previous Viterbi path probability from the previous time
step
 ai j- the transition probability from previous state i to current state j.

 b j(ot )- the state observation likelihood of the observation symbol ot


given the current state j
DECODING: THE VITERBI ALGORITHM

 The Viterbi algorithm is identical to the forward algorithm


except that it takes the max over the previous path
probabilities where the forward algorithm takes the sum.
 The Viterbi algorithm has one component that the forward
algorithm doesn’t have: backpointers.
 This is because while the forward algorithm needs to produce
an observation likelihood, the Viterbi algorithm must produce
a probability and also the most likely state sequence.
 We compute this best state sequence by keeping track of the
path of hidden states that led to each state, and then at the end
tracing back the best path to the beginning (the Viterbi
backtrace).
DECODING: THE VITERBI ALGORITHM
 A formal definition of the Viterbi recursion as follows:
1) Initialization:
v1( j) = a0 jb j(o1) 1 ≤ j ≤ N
bt1( j) = 0
2) Recursion (recall states 0 and qF are non-emitting):
N
vt ( j) = max vt−1(i) ai j b j(ot ); 1 ≤ j ≤ N, 1 < t ≤ T
i=1
N
btt ( j) = argmax vt−1(i) ai j b j(ot ); 1 ≤ j ≤ N, 1 < t ≤ T

i=1
3) Termination:
N
The best score: max vT (i) ∗ ai,F
i=1
DECODING: THE VITERBI ALGORITHM
DECODING: THE VITERBI ALGORITHM
O(1) O(2) O(3)
3 1 3
COLD 1(C) a2(C) a3(C)
(C) = P(C|start)  P(3| = max(1(C)P(C|C)  P(1|C), = max(2(C)P(C|C)  P(3|C) ,
C) 1(H)P(C|H)  P(1|C)) 2(H)P(C|H)  P(3|C))
=0.2  0.1 = 0.02 = max(0.02  0.6  0.3, 0.32 = max(0.048  0.6  0.1 , 0.0448
0.3 0.5) = 0.048 0.3 0.1) = 0.00288

HOT 1(H) a2(H) a3(H)


(H) = P(H|start)  P(3| = max(1(C)P(H|C)  P(1| = max(2(C)P(H|C)  P(3|H) ,
H) H) ,1(H)P(H|H)  P(1|H)) 2(H)P(H|H)  P(3|H))
= 0.8 0.4 =0.32 = max(0.02  0.4  0.2, 0.32 = max(.0.048  0.4  0.4, 0.0448
0.7 0.2) = 0.0448 0.7 0.4 )
= 0. 012544

P(O)= max(3(C) P(end|C) , 3(H) P(end|H))


= (0.00288  0.5 , 0.012544  0.5)
=max(0. 00144 , 0.006272)
= 0.006272
The most probable Hidden State sequence is Hot Cold Hot
HMM-BASED PART-OF-SPEECH TAGGING
 A Hidden Markov Model can also be used to determine the part-
of-speech tag for every input sentence.
 For POS tagging system, the words given in the sentence are the
observed sequence and the tags for the words are the hidden
states.
 Given the transition probabilities (probability of transition
among part-of-speech) and emission probabilities (probability of
emission of a particular word by a part-of-speech), determining
the most probable tag sequence can be considered as a decoding
problem that can be solved using Viterbi decoding.
HMM-BASED PART-OF-SPEECH TAGGING CONTD...
 Let us consider there are T words (observed sequence) in the sentence and there are N POS tags
(hidden states). Following three steps of Viterbi decoding will determine the most probable tag
sequence and its probability:
1) Initialization:
v1( j) = a0 jb j(o1) 1 ≤ j ≤ N
bt1( j) = 0
2) Recursion (recall states 0 and qF are non-emitting):
N
vt ( j) = max vt−1(i) ai j b j(ot ); 1 ≤ j ≤ N, 1 < t ≤ T
i=1
N
btt ( j) = argmax vt−1(i) ai j b j(ot ); 1 ≤ j ≤ N, 1 < t ≤ T

i=1
3) Termination:
N
The best score: max vT (i) ∗ ai,F
i=1
HMM BASED POS TAGGING - EXAMPLE
 Consider the HMM be defined by the two tables. Table I expresses the aij
probabilities, the transition probabilities between hidden states (i.e. part-of-
speech tags). Table II expresses the bi(ot ) probabilities, the observation
likelihoods of words given tags. The values are computed from the 87-tag Brown
corpus without smoothing.

VB TO NN PPSS
<s> .019 .0043 .041 .067
VB .0038 .035 .047 .0070
TO .83 0 .00047 0
NN .0040 .016 .087 .0045
PPSS .23 .00079 .0012 .00014

Table I (The symbol <s> is the start-of-sentence symbol, the rows are labeled with
conditioning event)
HMM BASED POS TAGGING - EXAMPLE
Table 2:

I want to race
VB 0 .0093 0 .00012
TO 0 0 .99 0
NN 0 .000054 0 .00057
PPSS .37 0 0 0

For the given HMM, compute the most probable tag sequence for the
sentence “ I want to race”
The tags VB, TO, NN, PPSS represents Verb (base form), infinite marker
(to), common noun, nominative noun respectively.
O1(I) O2 (want) O3(to) O4(race)

VB (1) V1(1) V2(1) V3(1) V4(1)


=0.0190 =max(0,0,0,0.024790.23) =max(0.0000530.0038,0, =max(0,0.00000180.83,0,0)0
=0 0.0093 0.000000120.047,0) 0 .00012
=0.000053 =0 =1.7910-9

TO (2) V1(2) V2(2) V3(2) V4(2)


=0.00430 = max(0,0,0,0.024790) 0 =max(0.0000530.035,0, =max(0,0.00000180,0,0)0
=0 =0 0.000000120.016,0) 0.99 =0
=max(0.0000018,0,0,
0.00000001)
=0.0000018

NN (3) V1(3) V2(3) V3(3) V4(3)


=0.0410 = max(0,0, 0,0.024790.087) =max(0.0000530.047,0, =max(0,0.00000180.00047,0,0
=0 0.000054 0.000000120.087,0) 0 )0.00057
=0.00000012 =0 =4.8810-13

PPSS (4) V1(4) V2(4)= V3(4) V4(4)


=0.0670.37 max(0,0,0,0.024790.00014) =max(0.0000530.0070,0, =max(0,0.00000180,0,0)0
=0.02479 0 0.000000120.0045,0) 0 =0
=0 =0

Therefore, the tag sequence for the sentence I want to race is:
PPSS VB TO VB

You might also like