Hidden Markov Models (HMMS) : Prabhleen Juneja Thapar Institute of Engineering & Technology
Hidden Markov Models (HMMS) : Prabhleen Juneja Thapar Institute of Engineering & Technology
Hidden Markov Models (HMMS) : Prabhleen Juneja Thapar Institute of Engineering & Technology
MODELS
(HMMS)
Prabhleen Juneja
Thapar Institute of Engineering & Technology
INTRODUCTION
Hidden Markov Models (HMMs) is a sequence classifiers. A sequence
classifier or sequence labeler is a model whose job is to assign some label or
class to each unit in a sequence.
Given a sequence of units (words, letters, morphemes, sentences, whatever)
the job of HMM is to compute a probability distribution over possible labels
and choose the best label sequence.
The finite-state transducer for morphological analysis is a kind of non-
probabilistic sequence classifier, for example transducing from sequences of
words to sequences of morphemes.
The HMMs extend this notion by being probabilistic sequence classifiers.
Sequence labeling tasks come up throughout speech and language
processing. For instance, in part-of-speech tagging, each word in a sequence
has to be assigned a part-of-speech tag.
Besides part-of-speech tagging, application of these sequence models include
tasks like speech recognition, grapheme-to-phoneme conversion, and named
entity recognition and information extraction.
MARKOV CHAINS
The Hidden Markov Model is one of the most important machine
learning models in speech and language processing.
The Hidden Markov Model is descendant of Markov’s models or
chain, sometimes called the observed Markov model.
Markov chains and Hidden Markov Models are both extensions of the
finite automata.
A Markov chain is a special case of a weighted automaton. A
weighted finite-state automaton is a simple augmentation of the finite
automaton in which each arc is associated with a probability, indicating
how likely that path is to be taken. The probability on all the arcs
leaving a node must sum to 1.
Because it can’t represent inherently ambiguous problems (finite
state), a Markov chain is only useful for assigning probabilities to
unambiguous sequences.
MARKOV CHAINS CONTD……
Fig. a shows a Markov chain for assigning a probability to a sequence of
weather events, where the vocabulary consists of HOT, COLD, and RAINY.
Fig. b shows another simple example of a Markov chain for assigning a
probability to a sequence of words w1...wn. This Markov chain in fact
represents a bigram language model.
MARKOV CHAINS CONTD……
A Markov chain as a kind of probabilistic graphical model; a way of
representing probabilistic assumptions in a graph. A Markov chain is specified
by the following components:
Q = q 1q 2 . . . q N a set of N states
A = a01a02 . . . an1 . . . ann a transition probability matrix A,
each ai j representing the probability of moving from
state i to state j such that
n
a
j 1
ij 1
i 1
i 1
a
j 1
ij 1
P(qi|q1...qi−1) = P(qi|qi−1)
Second, the probability of an output observation oi is dependent
only on the state that produced the observation qi, and not on
any other states or any other observations
Output Independence Assumption:
Since we don’t know what the hidden state (weather) sequence was. We’ll
need to compute the probability of ice-cream events 3 1 3 by summing
over all possible weather sequences, weighted by their probability.
The joint probability of being in a particular weather sequence Q and
generating a particular sequence O of ice-cream events is given by:
n n
P(O, Q) = P(O|Q) × P(Q) = ∏P(oi|qi) × ∏ P(qi|qi−1)
i=1 i=1
The computation of the joint probability of ice-cream observation 3 1 3
and one possible hidden state sequence hot hot cold is as follows:
P(3 1 3, hot hot cold) = P(hot|start) × P(hot|hot) × P(cold|hot)
PROBLEM I: COMPUTING LIKELIHOOD CONTD….
at ( j) = P(o1, o2 . . . ot , qt = j|l)
which is computed as:
N
at ( j) = ∑at−1(i)ai jb j(ot )
i=1
COMPUTING LIKELIHOOD: FORWARD ALGORITHM
1) Initialization:
1( j) = a0 jb j(o1) 1 ≤ j ≤ N
N
vt ( j) = max vt−1(i) ai j b j(ot )
i=1
The three factors that are multiplied in Eq. 6.20 for extending the
previous paths to compute the Viterbi probability at time t are:
vt−1(i)- the previous Viterbi path probability from the previous time
step
ai j- the transition probability from previous state i to current state j.
i=1
3) Termination:
N
The best score: max vT (i) ∗ ai,F
i=1
DECODING: THE VITERBI ALGORITHM
DECODING: THE VITERBI ALGORITHM
O(1) O(2) O(3)
3 1 3
COLD 1(C) a2(C) a3(C)
(C) = P(C|start) P(3| = max(1(C)P(C|C) P(1|C), = max(2(C)P(C|C) P(3|C) ,
C) 1(H)P(C|H) P(1|C)) 2(H)P(C|H) P(3|C))
=0.2 0.1 = 0.02 = max(0.02 0.6 0.3, 0.32 = max(0.048 0.6 0.1 , 0.0448
0.3 0.5) = 0.048 0.3 0.1) = 0.00288
i=1
3) Termination:
N
The best score: max vT (i) ∗ ai,F
i=1
HMM BASED POS TAGGING - EXAMPLE
Consider the HMM be defined by the two tables. Table I expresses the aij
probabilities, the transition probabilities between hidden states (i.e. part-of-
speech tags). Table II expresses the bi(ot ) probabilities, the observation
likelihoods of words given tags. The values are computed from the 87-tag Brown
corpus without smoothing.
VB TO NN PPSS
<s> .019 .0043 .041 .067
VB .0038 .035 .047 .0070
TO .83 0 .00047 0
NN .0040 .016 .087 .0045
PPSS .23 .00079 .0012 .00014
Table I (The symbol <s> is the start-of-sentence symbol, the rows are labeled with
conditioning event)
HMM BASED POS TAGGING - EXAMPLE
Table 2:
I want to race
VB 0 .0093 0 .00012
TO 0 0 .99 0
NN 0 .000054 0 .00057
PPSS .37 0 0 0
For the given HMM, compute the most probable tag sequence for the
sentence “ I want to race”
The tags VB, TO, NN, PPSS represents Verb (base form), infinite marker
(to), common noun, nominative noun respectively.
O1(I) O2 (want) O3(to) O4(race)
Therefore, the tag sequence for the sentence I want to race is:
PPSS VB TO VB