Profile HMM
Profile HMM
Profile HMM
Acknowledgements:
M.Sc. students Beatrice Miron,
Oana R
atoi, Diana Popovici
1.
PLAN
1. From profiles to Profile HMMs
2. Setting the parameters of a profile HMM;
the optimal (MAP) model construction
3. Basic algorithms for profile HMMs
4. Profile HMM training from unaligned sequences:
Getting the model and the multiple alignment simultaneously
5. Profile HMM variants for non-global alignments
6. Weighting the training sequences
2.
Example
bat
rat
cat
gnat
goat
X
A
A
A
A
X
G
G
G
.
A
A
-
.
G
A
A
-
.
A
A
-
X
C
C
C
C
A
C
G
T
-
1
4/5
0
0
0
1/5
2 . . . 3
0
0
0
4/5
3/5
0
0
0
2/5
1/5
3.
Building up a solution
At first sight, not taking into account gaps:
Mj
Begin
End
Begin
Mj
End
4.
Begin
Mj
End
Begin
Mj
End
5.
Ij
Begin
Mj
End
6.
Does it work?
bat
rat
cat
gnat
goat
X
A
A
A
A
1
X
G
G
G
2
.
A
A
.
.
G
A
A
.
.
A
A
.
X
C
C
C
C
3
Begin
End
7.
12
Begin
1
X
1
x y
12
qx i
1
Y
End
qy
8.
However, remember...
An example of the state assignments for global alignment using the affine gap model:
Ix
Ix
Iy
Ix
Iy
Iy
Iy
Iy
Ix
Ix
Ix
Ix
Iy
Ix
Iy
Iy
9.
Consequence
It shouldnt be difficult to re-write
the basic HMM algorithms
for profile HMMs!
10.
cja
eMj (a) = P
a cja
where cja is the observed frequency of residue a in the
column j of the multiple alignment.
11.
X
G
G
G
2
.
A
A
.
.
G
A
A
.
Begin
M
1
.
A
A
.
match
emissions
X
C
C
C
C
3
insert
emissions
M
2
M
3
state
transitions
End
MM
MD
MI
IM
ID
II
DM
DD
DI
cja + Aqa
eMj (a) = P
a cja + A
other solutions (e.g. Dirichlet mixtures, substitution matrix mixtures, estimation based on an ancestor), you may see Durbin et al.,
1998, Section 6.5.
12.
13.
14.
15.
Mj the log probability contribution for match state symbol emissions in the column j
Li,j the log probability contribution for the insert state symbol
emissions for the columns i + 1, . . . , j 1 (for j i > 1).
Traceback:
from j = L+1 , while j > 0:
mark column j as a match column; j = j
Complexity:
O(L) in memory and O(L2 ) in time for an alignment of L columns...
with some care in implementation!
Note: is a penalty used to favour models with fewer match states. In
Bayesian terms, is the log of the prior probability of marking each column. It implies a simple but adequate exponentially decreasing prior distribution over model lengths.
16.
17.
Recursion:
vMj1 (i 1) aMj1 Mj
vMj (i) = eMj (xi ) max vIj1 (i 1) aIj1 Mj
vDj1 (i 1) aDj1 Mj
vMj (i 1) aMj Ij ,
vIj (i) = eIj (xi ) max vIj (i 1) aIj Ij
vDj (i 1) aDj Ij
Termination:
the final score is vML+1 (n), calculated using the top recursion relation.
18.
Recursion:
Termination:
the final score is VML+1 (n), calculated using the top recursion relation.
19.
20.
Termination:
fML+1 (n + 1) = fML (n)aML ML+1 + fIL (n)aIL ML+1 + fDL (n)aDL ML+1
21.
Recursion:
bMj (i) = bMj+1 (i + 1)aMj Mj+1 eMj+1 (xi+1 )+bIj (i + 1)aMj Ij eIj (xi+1 )+bDj+1 (i)aMj Dj+1
bIj (i) = bMj+1 (i + 1)aIj Mj+1 eMj+1 (xi+1 ) + bIj (i + 1)aIj Ij eIj (xi+1 ) + bDj+1 (i)aIj Dj+1
bDj (i) = bMj+1 (i+1)aDj Mj+1 eMj+1 (xi+1 )+bIj (i+1)aDj Ij eIj (xi+1 )+bDj+1 (i)aDj Dj+1
22.
23.
24.
25.
26.
Model surgery
After training a model we can analyse the alignment it produces:
From counts estimated by the forward-backward procedure we can see
how much a certain transition is used by the training sequences.
The usage of a match state is the sum of counts for all letters emitted
in the state.
If a certain match state is used by less than half the number of given
sequences, the corresponding module (triplet of match, insert, delete
states) should be deleted.
Similarly, if more than half (or some other predefined fraction) of the
sequences use the transitions into a certain insert state, this should
be expanded to some number of new modules (usually the average
number of insertions).
27.
Local
multiple alignment
Begin
End
28.
29.
For the rare case when the first residue might be missing:
Begin
End
30.
Begin
End
31.
t 6=3
6
t 4=8
t 5=3
t 3=5
5
t 1=2
t 2=2
1
32.
V7
I 1+ I 2+ I 3
V6
I4
V5
I1
I 1+ I 2
I2
I3
Equations:
V5 = 2I1 = 2I2
V6 = 2I1 + 3(I1 + I2 ) = 5I3
V7 = 8I4 = 5I3 + 3(I1 + I2 + I3 )
Result:
I1 : I2 : I3 : I4 = 20 : 20 : 32 : 47
33.
Then
t 6=3
i = tn P
leaves k below n
6
t 4=8
t 5=3
t 3=5
5
t 1=2
t 2=2
1
So, at node 5:
w1 = w2 = 2 + 3/2 = 3.5
At node 6:
w1 = w2 = 3.5 + 3 3.5/12,
w3 = 5 + 3 5/12
Result:
w1 : w2 : w3 : w4 = 35 : 35 : 50 : 64
34.
35.
Examples
(including the use of weights in the computation of parameters for
the HMM profile):