3S03 OnLineText
3S03 OnLineText
3S03 OnLineText
Department of Biology
McMaster University
Hamilton, Ontario
L8S 4K1
Elementary Sequence Analysis
ii edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
These notes are in Adobe Acrobat format (they are available upon request in other formats) and they can be obtained from
the website http://helix.biology.mcmaster.ca/courses.html. Some of the programs that you will be using in this course and
which will be run locally can be found at http://evol.mcmaster.ca/p3S03.html.
The “blue text” should designate links within this document while the “red text” designate links outside of this document.
Clicking on the latter should activate your web browser and load the appropriate page into your browser. If these do not
work please check your Acrobat reader setup. The web links are accurate to the best of our knowledge but the web changes
quickly and we cannot guarantee that they are still accurate. The links designated next to the JAVA logo, , require that
JAVA be installed on your computer.
These notes are used in Biology 3S03. The purpose of this course is to introduce students to the basics of bioinformatics and to give them
the opportunity to learn to manipulate and analyze DNA/protein sequences. Of necessity only some of the more simple algorithms will
be examined.
The course will hopefully cover . . .
The formal part of the course will consist of two approximately one hour lectures each week. Weekly assignments will be be provided
to practice and explore the lecture material. In addition there will be an optional tutorial to help students with these assignments or other
problems. These assignments will be 40% of your grade and three, in class quizzes will make up the remainder.
1 Preliminaries 1
1.1 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Electronic Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Journal sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Biological preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Some notes on terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.2 Letter Codes for Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Genomics 27
3.1 Where the data comes from . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 How DNA is sequenced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Elementary Sequence Analysis
iv edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
4 Databases 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 N.C.B.I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 E.M.B.L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 D.D.B.J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 SwissProt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Organization of the entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Other Major Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Remote Database Entry retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8.1 Entrez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8.2 NCBI retrieve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.8.3 EMBL get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.9 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 v
Preliminaries
1.1 Resources
There are many resources that one can make use of to study bioinformatics and they are becoming increasingly available
to the general public. These notes are my attempt at a small contribution toward this growing body of ‘on-line’ literature,
software, data and knowledge.
Please note that bioinformatics is inherently a multi-disciplinary field making use of biological, mathematical, statistical
and computer science knowledge. As such any resources available for any of these disciplines will be of use in bioinfor-
matics. The more skilled that you are in any one of these areas the better off you will be. But you should have a basic
minimum knowledge from each of these fields to study bioinformatics. There is a growing body of information available
that is specific for bioinformatics.
You should be aware that there are many other valuable online resources that are available to you. As these come and go too
quickly over time, I have stopped listing them. You can find some from lists at bioinformatics.org) or bioinformatics.ca).
There are many software packages that provide you with access to a collection of programs that deal with bioinformatics.
For example, if you have cash, the famous MatLab software suite provides a toolbox for bioinformatics. For those with
less cash, there are interesting projects – Biolinux, Bioknoppix, Bioknoppix (apparently discontinued; but last update June
2013), Vigyaan – that provide you with a bootable CD image. Simply burn the CD (it is free) and then boot from the
CD. This provides a free computer system with lots of bioinformatic, biomolecular software at your fingertips (nothing to
install, nothing to change on your computer, simply remove CD and reboot when done). There are many other software
sources that will be explored in this course (and provided through the links of these notes). For our purposes some of the
software that will be discussed below has been provided for you at http://evol.mcmaster.ca/p3S03.html. There are other
sites that provide as a free service, servers to run programs on. For example check out Mobyle.
Elementary Sequence Analysis
2 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
1.1.2 Textbooks
There are now an enormous number of books available that deal with sequence analysis and bioinformatics in biology. A
selection of just a few that have been published in the last few years are
and there is EVEN now a book from the popular “dummy” series
In addition to these there are many texts on evolution, on DNA and on proteins that have useful chapters and sections on
sequence analysis.
• Briefings in Bioinformatics
• BMC Bioinformatics
• Genome Biology
• Genome Research
• Genomics
• In Silico Biology
• J. Mathematical Biology
• Mathematical Biosciences
• Systematic Biology
and in the medical sciences there are many (!) more, including
There are some terms that will be used here, that are commonly abused. Unfortunately, I too will use some terms that are
not precise so you should be aware of the proper definitions (the following are modified from Futuyma 1986, Evolutionary
Biology, Sinauer Assoc.).
Homology Contrary to some statements in other bioinformatic texts, homology and similarity are not the same thing.
A trait from two different species or taxa are said to be similar if they have some resemblance of one to another.
Homology means a great deal more. Two traits from a different species or taxa are homologous if they are derived
(with or without modification) from a common ancestor.
In general when working with sequences, one assumes homology if one finds excessive similarity between the two
sequences. However, you should be aware that this is an inference that should be consciously made.
Example: The traditional example is that of the wings of birds and bats. Their wings are similar in that they enable
flight, have the same name and have similar aerodynamic constraints but they are not homologous. They are not
homologous because the common ancestor of both birds and bats did not have wings, rather wings evolved within
each group separately.
Mutations A mutation is an error in the replication of a nucleotide sequence. It may encompass one or many nucleotides
and in complicated situations may involve disjoint nucleotides. They can be caused by internal errors of metabolism
or by external agents such as radiation.
Substitutions Mutations are not substitutions. Substitutions are differences in two sequences (generally the descendant
from the ancestral) caused originally by mutations but which have been acted on by selection.
Example: Because substitutions have been exposed to selection, the frequency of occurrence of individual substitu-
tions and mutations are quite different. In general substitutions at the second position of a codon are (almost always)
much less frequent than those in the third codon position. This is because a change at the second codon position will
alter the amino acid encoded but this is not always the case for changes at the third codon position. By contrast, we
expect mutations to occur equally frequently at each of the codon positions.
Replacements The term replacement is suggested to be used when differences between amino acid sequences are ob-
served.
To store a large amount of data on a computer it would be quite inefficient to store the amino acids as “Glutamic acid” or to
store ambiguous nucleotides as “A or G”. For this reason there are standard codes to represent amino acids and nucleotides.
Both of these are one letter codes and can be stored on electronic media with reasonable efficiency.
Amino acids have in the past often been designated by a three letter code. This three letter code is not suitable for electronic
media and is now largely obsolete. The standard one letter amino acid codes are shown in Table 1.1. Also commonly in
use are B to represent either Aspartic acid or Asparagine and Z to represent either Glutamic acid or Glutamine.
There are also standard one letter codes to represent nucleotides. While most people are familiar with the simple codes of
A, C, G, T, and U there are more extensive codes to include ambiguities in the nucleotides. The extended one letter code
for nucleotides is given in Table 1.2. The complete generality of this code is seldom used. More common is the use of only
part of the extended code
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 11
Adenine A A 10001000
Guanine G G 01001000
Cytosine C C 00101000
Thymine T T 00011000
Purine G or A R 11000000
Pyrimidine T or C Y 00110000
Amino A or C M 10100000
Keto G or T K 01010000
Strong (3H bonds) G or C S 01100000
Weak (2H bonds) A or T W 10010000
Not G A or C or T H 10110000
Not A G or T or C B 01110000
Not T G or C or A V 11100000
Not C G or A or T D 11010000
Any G or C or T or A N 11110000
Gap - - 00000100
Unknown ? X 00000010
R A or G
Y T or C
N A,T,C or G
X unknown
Some programs prefer to store RNA codes rather than DNA codes. In general T and U can often be taken as synonyms.
Elementary Sequence Analysis
12 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Chapter 2
Bioinformatics is about the manipulation of biological data and how to turn that data stream into biological knowledge.
Due to the large amounts of data, this cannot be done by hand but fortunately the explosion of biological data streams in the
last decades have been matched by a revolution in computer technology. In this section I give the briefest of introductions
to the UNIX operating system. Part of the philosophy of the UNIX system is to provide the user with basic tools that
accomplish one abstracted and atomized task well rather than providing a polished piece of software that accomplishes a
single end result. In research, it is useful to have the tools that can be combined in order to construct a new result rather
than to use a software suite that accomplishes only what it’s original designer envisioned. As a result, UNIX is an operating
system that is said to ”wear well”, meaning that the more you learn the more you can do, the more you can accomplish and
the more there is to learn.
My purpose in this chapter is to quickly introduce a novice to UNIX computers to such a level that they can perform useful
work. For this course ‘work’ is being defined as an ability to enter, to search out, but mostly to manipulate and to analyze
sequences and produce information that is biologically relevant.
In my opinion, however, by learning with the graphical interface you lose much of the power of UNIX because, again, in
that case the interface to the operating system does it all for you. You don’t learn how to do it yourself and when you want
to do something more than what the interface offers — how would one do this? Is it even possible? The answer is, of
course, yes and that is why I will have you painfully suffer through a command line interface.
In UNIX the operating system is designed to give you tools and make it “easy” for you to design your own tools to do any
desired job. This means that you must learn something about the operating system and the many tools that are available. In
addition knowing how to use just one tool is seldom sufficient to accomplish complicated tasks. Here I can provide only a
brief introduction to the most important concepts and a few commands to get you started.
When you find my ramblings insufficient you can find more more information in any of a thousand books on UNIX. One
I can recommend for beginners is “UNIX for the Impatient” by P.W. Abrhams and B.R. Larson, 1992/1997 (Addison
Wesley). There are also many introductory packages available ‘on line’. You might wish to explore Edinburgh’s UNIX
help for beginners or CERN’s UNIX users guide, UNIX Survival, or UNIX Resources.
UNIX based workstations are built by several companies including APPLE, IBM, Hewlett Packard, Sun Microsystems,
Silicon Graphics, and others. In addition there are free versions of UNIX available for a large variety of computers.
Each of these has a slightly different flavour of UNIX but the minor subset of commands that we will entertain here are
constant across platforms. UNIX is an old operating system (approximately 1969) with many capabilities. Along with
these capabilities are often idiosyncrasies that are historical.
When ever attempting access to a UNIX computer the user will be prompted with a request for their user identification
(userID) and their password. UNIX has, from the start, been a multiuser computer system and it uses the userID to keep
individual users separate and the passwords to provide a first level of security. The prompt to sign on will usually be either
a request for USERID:, LOGIN:, USERNAME:, and so on. Your password will not be echoed back to the screen as you
type it for obvious security reasons. Passwords should be eight or more characters long, should include some numbers or
symbols, and should never be a word that is found in a search-able dictionary or database.
Upon successful access, the computer may display what kind of computer you are on, it may display when and where you
last logged in from, and usually will check if you have any new mail. It will then present you with a prompt and await
commands. The prompt is customizable and can include such things as the computer’s name, command numbers, date,
and so on.
To exit the computer simply type exit from the prompt. Again this is customizable to be anything you desire.
File Structure:
Files are organized hierarchically. At the base of the file system is a root directory simply referenced by ‘/’. Underneath
this file will be other files that can be of several different types. Under each file that is identified as a “directory”, other
subdirectories are possible. Different levels of directories, subdirectories, sub-subdirectories, etc. are separated by a ‘/’.
The organization of files on a UNIX system is somewhat standardized but each vendor will do it with their own unique
variations. On a typical LINUX machine the top directory contains the files . . .
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 15
boot
etc
lib
misc
net
proc
sbin
under this subdirectory system-wide programs unique to
usr -
each installation are normally placed
bin
dev
home - where your personal files will be located
lost+found
traditionally where filesystems for diskettes, cdroms,
mnt -
flash disks etc. will be placed
opt
root
tmp - files needed momentarily by the system or programs
var
All of these files are subdirectories (often also called folders). Most of these subdirectories contain files that are used by
the operating system and most of these are files that you should never have to be concerned with. The user directories
(where you can put your files) are generally placed under the subdirectory /home. So your home directory and where you
will automatically be placed when you enter the machine will be /home/yourid. This is the standard location for all
LINUX computers. On other machines it may be somewhere else. For example, on a Silicon Graphics computer home
directories are usually /usr/people/yourid.
The other important directory that you should know about is /usr/local. This is the location where many files unique
to a particular machine are normally placed. For example, in this location my machine has special files to do sequence
analysis, phylogenetic analysis, etc. that are useful to every user. So rather than locating them in just one user’s files, i.e.
/home/yourid, they are stored in a central location that all can access — /usr/local. You may want to look there
to see some of the programs installed on your local machine. The binaries (the actual executable files) for many of these
programs are often stored in /usr/local/bin.
File Names:
Information is stored in separate files each with unique identification names. Names and extensions are arbitrary and have
no reasonable limit to length or characters. The extension may be used to indicate the type of information contained in
the file. Although not a requirement, a fortran file will generally have an ‘.f’ extension, a pascal file will have a ‘.p’
extension, and so on. Any character is permitted and even the period ‘.’ is just another character in most file names and is
not treated in any special way. Hence a filename such as “test.dat.obj” is quite acceptable. You can have a blank space as
part of a filename as in “test dat obj” but this becomes quite confusing for some programs (and people) and hence is not a
recommended practice.
The full name of a file will be something like /usr/local/test.f Note that a forward slash is used to designate
different subdirectories. Here usr is a primary (root or top) level subdirectory, local is a subdirectory within this and
test.f would be a file in the local subdirectory. There are not different version numbers of a file and if a backup copy
is desired it must have a different name.
UNIX operating systems have always been case sensitive. The file /usr/local/Test.f is therefore a different file
from that listed above. The difference is that a ‘T’ and a ‘t’ are treated as entirely different characters. This applies not
only to file names but also to commands. In general, the default will be lower case. However, if some book or example
indicates that one or more letter is upper case, then this case must be copied exactly to achieve the desired result.
File Types:
There are several types of files. The most common types are textual, binary, directory, and symbolically linked files. The
Elementary Sequence Analysis
16 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
latter are files that simply point to other files. There are other types but these are the ones most commonly encountered.
Paths:
Individual files can be specified in several different ways. An absolute name can be given such as /usr/bin/ls. Or a
relative address could be used such as ../../ls which means to go up two subdirectories levels and find a file/program
called ls. If you simply types ls the computer will search for a file/program called ls in the current directory and if not
found, will search through a specified set of directories (you can change this list).
Your Files:
You can create files in your own home directory, but you (usually) cannot create files in another persons home directory.
This is dependent on the extent of the privileges you have been given. A ‘long’ listing of a file’s characteristics would look
something like . . .
-rwxr-xr-x 1 brian user 133 Sep 13 1996 fnd
-rw-r--r-- 1 brian user 2564 Dec 23 15:04 fumarate.pdb
drwxr-xr-x 4 brian user 2560 Nov 26 13:55 gde96
-rw-r--r-- 1 brian user 13181 Sep 15 1996 genodb.html
drwxr-xr-x 2 brian user 512 Nov 12 14:14 genome.sites
-rw------- 1 brian user 3797 Jul 15 1996 hummingbird.tech
The first letter (here either d or -) indicates the type of file (here subdirectories or textual/binary), the next three letters
give the permissions that the file’s owner has to read, write or execute that file (in the case of a directory, execute is the
ability to enter or view the subdirectory). The next three letters give the read, write execute permissions of anyone in the
same user group, and the next three letters given the permissions of anyone on the computer. So in this example only the
owner is given permission to change these files, but anybody can read/enter all files/directories except for the last one. The
next number gives the number of links this file has. This is followed by the owner of these files, brian and the group to
which this user belongs, user. The size of the files, the date of creation and the name of the file is shown.
Historically UNIX has been a very open system and hence the default permissions are such that any user can usually
read the files of all other users. In the past this has created some security problems but these particular problems are
comparatively easy to fix. You have the ability to change this default behaviour or to change the permissions of any file.
What’s in a name:
Since I recommned learning the command line, it might terrify people that they are now expected to have to type out in full
(and since these are computers, to type without error) a long file name that might be full of jargon. This is not necessary.
Filename completion is a feature that most programs offer. For the command line this is a feature that is accomplished by a
program called the “shell” (to be further discussed below). The shell on most UNIX machines usually sets up the ‘tab’ key
to try complete the filename and if not unique, to finish it as far as possible and then present the remaining possibilities.
Using this feature it is seldom necessary to type more than 2 – 3 characters of the filename.
In addition there are other ways to play with file names. An asterisk is used to match a sequence of zero or more characters.
So a filename of the form
“a*.b”
will match all filenames that begin with ‘a’ and end with ‘.b’. A ‘?’ denotes a single character but not two characters,
nor zero characters. A collection or a range of characters can be matched by enclosing the range in square brackets. So a
filename of the form
“numbered files.[1-9]”
will match the files “numbered files.1, numbered files.2, . . . , numbered files.9”. The construct ‘[fa2F-Z]’ would match the
listed characters, ‘f’,‘a’,‘2’, as well as any character from ‘F’ to ‘Z’ (remember case sensitive). Or, if you have a contrary
personality, the construct ‘[ˆfa2F-Z]’ would match anything but this set of characters. Finally a selection of names can be
obtained by enclosing the selection in curly parentheses. The construct
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 17
‘afile.{log,aux}’
2.1.3 Commands
Commands have a general structure that consists of the com-
mand name, followed by arguments. If the argument begins The top ten commands
with a ‘-’ it is called a flag and is used to modify the behaviour
of the command. Most commonly used commands are placed in ls list file names
subdirectories that are searched by default and hence their com- pwd show present working directory
plete location does not have to be specified. A typical program cd change directory
is run by typing mv move/rename a file
rm remove a file
cmd -flag argument
cp copy a file
The name of the program is cmd and must be somewhere that cat show file content
the system normally looks for program names or alternatively more show partial file contents
the path must be specified. The flags (which may or may not be mkdir/rmdir make/remove directory
present) will alter the behaviour of the program while the argu- man show manual pages
ments (which may or may not be present) might include files of
information to be read from or written to by the program. A few
of the most basic, general commands follow.
ls - list files:
This command will give a listing of files in the current directory or in the directory supplied as an argument. As mentioned
abovie, to find particular files an asterisk acts as a wild card. A
ls a*.f
will list all files that begin with ‘a’ and end with ‘.f’. A command such as
ls -l
ls -t
will sort the listing according to date and so on. There are many other flags and the flags can be combined to achieve many
different responses from the same command.
cd - change directory:
Change to a subdirectory given as an argument. If no argument is given it will change to your ‘home’ directory. A command
Elementary Sequence Analysis
18 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
cd ..
will move up one directory. A tilde is a useful character to identify home directories. The command
cd ∼brian
more ∼/filename
mv fnd ..
will delete the file fnd from the current directory and place it one level higher. This command can also be used to rename
files -
mv fnd dnf
rm fnd
will simply delete the file. Be very careful with the use of asterisks and the rm command,
rm *
cp - copy files:
This command will copy files. It requires two arguments and the first named file is copied to the second file. If the second
file is a directory the file is copied to the named directory with a same filename. E.g.
cp fnd dnf
cp fnd gde96/
the space bar. A single line is advanced by an ‘enter’ key or n lines by typing the number n followed by the space bar. For
simple text files, the ‘b’ key will move backwards a screen.
lpr - print:
The lpr command will send a file (argument) to a printer. In general most UNIX machines are set up for postscript but
individual printers can be set to accept other types of input. Indeed many modern printers will switch ‘on the fly’ to match
the input it is receiving. Postscript is a graphics language that describes the structure of a figures (e.g. a circle of width x)
rather than individual points (e.g. actual bit mapped points). In this way it is independent of the resolution of any viewer,
simply providing instructions on how to display a figure at the viewer’s maximum resolution. You can recognize if a file
is postscript by either the filename extension (*.ps, *.eps or rarely, *.epsi) or by its contents. A postscript file will
usually begin with something like
%!PS-Adobe-2.0
%%Creator: WiX PSCRIPT
%%Title: ramermap1.cdr FROM CorelDRAW!
statusdict begin 0 setjobtimeout end
statusdict begin statusdict /jobname (ramermap1.cdr FROM CorelDRAW!) put end
{}stopped pop
{statusdict /lettertray get exec
....
If the file is not in postscript and the default printer on your system is an old printer and expects postscript then you must
translate the file first. A common (and free) program to do this is a2ps. This program takes a file (supplied as an argument)
and changes it to postscript (along with many other abilities), and then automatically pipes it to lpr.
For both lpr and a2ps (and many other programs) the -Pprintername flag will direct the output to the particular printer
chosen. For example, the command
will translate the file filename to postscript and pipe the output to the printer named ps.
2.1.4 Help
All of the above commands have many other abilities. To find out about these abilities there are manual pages stored on
the computer. Typing
man cat
will generate a manual page that describes this command and then passes this page to the more viewer.
If you have a graphic interface you can also run xman which has more capabilities. There is a move afoot to replace the
man programs with a similar but more advanced program called info (but I still prefer the old system). You will also often
find files under the directory /usr/doc or /usr/share/doc (along with the directory /usr/share/doc/HOWTO
which is particularly useful for beginners).
In addition you can search an index of manual pages with man -k word. All manual pages that are considered relevant
to word will be listed. If you want to know more about the man command, type
man man
(of course).
Elementary Sequence Analysis
20 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
2.1.5 Redirection
To UNIX, the keyboard and the screen are just different types of input/output streams. If desired you can redefine these.
For example, you can redirect output of a command such that it is not put onto the screen but rather put into a file. For
example the command
ls -aF
> myfiles
instructs the computer to put the output of the command into a new file to be called myfiles. So
cp fnd dnf
In general,
You can also specify a “pipe” symbolized by ‘|’ which will take the output of one command and use it as input for another
command. It is kind of like a
> <
ls /usr/lib | more
This will take the file listing of subdirectory /usr/lib and give that information to the more command.
2.1.6 Shells
Shells are a command interpreter that will be run on all UNIX computers. You can think of the shell as a layer of program
through which all of your commands are passed before being processed. Again there are many different shells. The popular
ones are the ‘sh’ Bourne shell, ‘bash’ Bourne again shell, ‘ksh’ Korn shell, ‘csh’ C shell, and the ‘tcsh’ shell. The latter is
the default shell run on the computer you will be using (though the bash shell is generally recommended).
The tcsh has several nice capabilities (many shared by the other shells). One of these is filename completion. If you
type the beginning of a filename and then type ‘tab’, the computer will finish this filename. If the request is ambiguous the
computer will finish the filename as far as possible and then beep. Typing <ctrl>d will display matching filenames up
to that point.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 21
This shell also keeps a numbered history of your last commands. You can step through them using the ‘up’/‘down’ arrow
keys. The commands can then be repeated or edited. For example, if you type
ls
cat fnd
and then type the ‘up’ arrow twice. This will return you to the ls command. Alternately, you could rerun the command
by typing just
!l
An exclamation mark followed by a string will repeat the last command beginning with that string. An exclamation mark
followed by a number will rerun that numbered command. The command history will give a numbered list of your past
commands.
This shell also permits the creation of aliases. Aliases can be set up as follows, type
ls -aF
(The ‘a’ flag shows hidden files and the ‘F’ flag adds a ‘/’ to end of a directory, a ‘*’ to the end of a binary, a ‘@’ to the
end of a link. Be careful as these are not actually part of the file name). Aliases can be bypassed by preceding them with a
backslash. Thus
\dir
will return dir to its original definition and ignore the alias (in our case dir is not a defined command and you will get
an error message).
ls -a
command.
Two of these files are .cshrc and .login. These files are read (and the commands inside executed) every time you start
up a csh (or tcsh) shell and every time you login to get onto the computer. The file .cshrc contains many aliases and you
can edit this file and add your own. Your default path to search for files and commands can also be defined in this file.
Many programs may define a .xxxrc file. They use this file to read and store variables that will be used in the programme.
ls &
Elementary Sequence Analysis
22 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
will run ls in the background and present you with another prompt. (But hopefully on my machines the ls should be
done before you get a chance to type in anything else).
Another way to do this, particularly if interactive commands are required only at the beginning of a program, is to type
<ctrl>z when the process has started its work. This will suspend the job (the job is not killed but nor is it active). To
restart the job type “fg” (mnemonic foreground). To put the job into the background type “bg” (mnemonic background).
To check on the jobs that you have running use a ps command. This will list the processes that the computer is currently
working on. To cancel a job use
kill pid
where “pid” is the number associated with the job according to the ps command. Alternatively
kill %n
where ‘n’ is the number of the job given to you when you typed ‘&’. Finally, to kill a program that is currently executing
(assuming it will still accept input from the keyboard), enter <ctrl>c.
2.1.9 Utilities
UNIX has many standard utilities that are very useful but I can only talk about a few here. Perhaps the most used is the
search utility that will find text in a file/files. There are a family of “grep” commands that perform these searches. The
command
This command will search all subdirectories under /usr that have a file named address.bok. It searches inside these
files for the text Frank. The flag -i causes the search to be done in a case insensitive fashion. As an example, to see only
the process that the machine is running for you rather than all processes, type
ps | grep yourid
Depending on what you wish to do, there are also egrep and fgrep variants of this utility. Some other commonly used
utilities are sort, cut, paste, diff, and tr. For information on these see the man pages.
2.1.10 Editors
There are many editors available both for free and commercially. If you have used pine, you have used the pico editor.
The most common and ubiquitous editor is EMACS. EMACS is available on most UNIX computers but, in the past, it has
been rather picky about the terminals it will talk with.
An older, more basic and works from anything editor is called vi. This editor was designed to work without the aid of a
mouse and to permit easy mapping of keyboards to accommodate multiple hardware manufacturers.
This editor is invoked by typing
vi filename
Again, if the file does not exist then it will be created by this command. There are two modes to the basic editor – command
mode and insert mode. In the former mode, everything typed from the keyboard is treated as a command while in the latter
mode, everything typed from the keyboard is added to the file. To change from insert mode to command mode use the
‘escape’ key. There are several ways to change from command mode to insert mode. To insert text after the cursor hit the
‘i’ key while in command mode. To append text to the end of a line use ‘a’ while in command mode. Use the ‘x’ key
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 23
to delete characters under the cursor. In command mode ‘dd’ will delete entire lines, so do not idly type keys while in
command mode. The arrow keys can be used to move around (albeit slowly). This editor has many commands and many
capabilities (see Table 2.1 or see a book on UNIX for more). To exit the editor, move to command mode by hitting ‘escape’
and then type ‘ZZ’ (note upper case!). Your work is automatically saved but a backup of the state of the old file is not
generally made.
On my computer vi is again aliased and in reality typing vi will invoke vim instead. This is a modernized version of vi
which is actively being developed (2008). This project has included mouse support if you have proper terminal definitions
for mouse standards (e.g. an xterm interface). It also has a graphic interface started by gv (another alias, actually gvim).
This update includes many features, the best being easy customization and simple programming abilities. To find out more
check out the vim web site, type
:help topic
inside the editor (note the preceding colon), or examine the free online documentation “vimbook-OPL.pdf” (follow the
link from http://www.vim.org/docs.php) or the book Hacking Vim (a book that support orphans in Uganda as part of the
vim project), or the commercial book Learning the vi and Vim editors.
2.2.1 ssh
The programs ssh and scp are programs that you should use in preference to telnet and ftp. These programs are
replacements for the older commands rsh and rcp, where the more logical ‘r’ stood for remote (hence to open a shell on
a remote computer – rsh, or to do a remote copy – rcp). The change of the ‘r’ for the ‘s’ stands for secure and the difference
between rsh and ssh is that the information is encrypted before it is sent across the internet (including encryption of any
transmitted username and password) and then de-encrypted ‘on the fly’ at the remote computer location. The encryption is
different each time a different connection is made and is difficult (but not impossible) to crack. To use ssh simply type
ssh remotehost
and you are off. You might see some information about the nature of the encryption, about exchange of keys and so on.
For scp the commands are the same as cp except that you can use a ‘:’ to separate file names from machine names. For
example,
will copy a file named filename1 in subdirectory stuff under the home directory (default) of user george on a
machine in California to user frank on a machine in New York even if you are a third user sitting on a machine in Canada
(of course you will have to have passwords to all three accounts; george’s, frank’s and the password for the machine in
Canada). See the man pages for further information on these protocols.
There are other programs that can be saved for later.
2.2.2 Mail
I have been stressing in this course the utility of a command line interface to UNIX. There are of course, therefore, character
based interfaces to e-mail. The two most popular character based interfaces are pine and mutt. Each are easy to use and
more importantly you can, with ease, include them in programs that you write (so for example, mail to Frank the results
of the program and send the ancillary data generated to Susan; or send off formated emails to specific people based on the
program results). These programs have some simple properties that make them very easy to use. As just one example to
send a mail message with hundreds of attachments (and annoy your friends) simply enter the command
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 25
This command will send an email to a particular user on a remote machine and will attach to this email all of the files in
the directory ‘directory’.
2.3 Scripts-Languages
Even if UNIX provides a wide variety of efficient tools to accomplish generalized basic tasks these tools cannot solve all
problems. Hence a bioinformatician must learn a programming and/or a scripting language.
All UNIX systems come with a variety of scripting languages and a variety of programming languages (usually pre-
installed). The simplest of these is usually the scripting language of the shell itself. Beginning a file with the line
#!/bin/sh
(or csh, or bash, or zsh, etc.) and changing its permissions to be executable (with the command, ‘chmod u+x
file’), will execute each of the commands that follow in the file according to that shell (the experts recommend scripting
in the sh shell or a modern shell rather than csh). If the line is missing, the current shell will be used. Any command and
any series of commands can be put in these files. More capabilities are offered by the sed script editor and more still by
the awk programming language.
Most systems will also include more extensive programming languages. Since UNIX is built upon C, almost all machines
include the C programming language (and more recently, the C++ programming language). Other popular computer
programming languages include java, perl, PHP and python. Some computers may include fortran (f77), pascal
(pc), tcl/tk and there are many others. I would recommend that students learn at least one scripting language for quick
and simple tasks and learn at least one compiled language for more computationally intensive tasks. In my group we
currently use PERL for rapid scripts and we use C for intensive things.
For each of these you can obtain limited information from the man pages. But to actually learn how to use them, you
should find some books or examine instructional web pages (indeed the first prize ever offered for a web-based courses
went to a site teaching C++ at MIT).
Detailed instructions on learning a computer language are beyond the scope of these notes. To learn PERL I would recom-
mend, Developing Bioinformatics Computer Skills by C.Gibas and P.Jambeck (2001) and Beginning Perl for Bioinformat-
ics by J.Tisdall (2001). To learn PYTHON I would recommend, a general text that includes PYTHON, Practical Computing
for Biologists by S.H.D.Haddock and C.W.Dunn (2010) or the encyclopedic Learning Phyton by M. Lutz (2013).
With respect to bioinformatics and sequence analysis there are important resources of which you should be aware.
There are libraries of subroutines and objects (bits of computer language code) that you can incorporate into your
own programs. Many of these libraries are publically and freely available for all to use. An extremely useful col-
lection of code, the perl library can be found at www.bioperl.org/wiki/Main Page. Java libraries can be found at
biojava.org/wiki/Main Page, a C++ library can be found at http://kimura.univ-montp2.fr/BioPP and a suite of programs
based on this library called Bio++ can be found at http://home.gna.org/bppsuite/, the Phylogenetic Analysis Library
(PAL) can be found at www.cebl.auckland.ac.nz/pal-project/; an effort being led by Dr. A. Drummond and Dr. K. Strim-
mer, and a collection of algorithm libraries for bioinformatics (AliBio) designed to be fast and efficient can be found at
http://www.bioinformatics.org/project/?group id=173.
If you would like to try out a more complete UNIX experience without giving up WINDOWS, you can download LINUX
operating systems for free and install them on your computer with the option of leaving the computer dual-bootable. What
this means, is that when the computer is started, it will ask whether you wish to begin WINDOWS or LINUX and then
launch whichever operating system is chosen. This type of an installation will automatically partion your drive and hence,
as always, it is a good idea to backup your files.
Free versions of LINUX are available at . . . ,
Ubuntu www.ubuntu.com
Debian www.debian.org
SUSE www.suse.com/linux/
Mageia www.mageia.com
(these are different flavours of LINUX by different groups or companies; there are others in addition to these). The
companies also offer support but this usually is no longer free.
If you are unsure or would just like to see what UNIX things are like, try a KNOPPIX or a “live” distribution. This is a form
of LINUX that runs from a DVD or from a memory stick. To start KNOPPIX or a live distribution simply insert the DVD
or memory stick into your computer and restart the machine. These systems do not alter the hard drive of the computer
and simply run the entire operating system off the removeable device (with the result that it can be a little slow). To test
drive this system make sure that when you download the operating system that you burn a ‘bootable’ image of the DVD
and make sure to set your computer to boot from the device. Simple instructions to do both of these are on the KNOPPIX
web site at www.knoppix.org or at the sites listed above.
I am happy to recommend a DVD, memory stick or full installation of Bi-
oLinux. This is a customized LINUX distribution that comes customized with
hundreds of bioinformatic tools all pre-installed. BioLinux can be obtained
from http://envgen.nox.ac.uk/biolinux.html out of NERC England.
Chapter 3
Genomics
In the last decade there has been a data explosion in the biological sciences. These have been termed the ’omics. The
most relevant to this course is genomics. Which I will briefly explore in this section. But beware there are many other that
are of relevance to this course and many of the techniques are relevant to all of the ’omics. Other fields that we will not
have the time to explore include proteomics, transcriptomics, metabolomics, pharmacogenomics, toxicogenomics and so
on. All have the fields have the same characteristic of generating enough data that a simple hands-on approach by a single
researcher is not adequate.
G A+G T+ C C
−ve Inferred DNA sequence
P 32
C T T C AGT AC GT C G
P 32
C T T C AGT AC GT C
P 32
C T T C AGT AC GT
P 32
C T T C AGT AC G
P 32
C T T C AGT AC
P 32
C T T C AGT A
P 32
C T T C AGT
P 32
C T T C AG
P 32
CT T CA
P 32
CT T C
P 32
CT T
P 32
CT
P 32
C
+ve
Figure 3.1: The Maxam-Gilbert method of sequencing DNA. The black bars indicate what would be seen in an autoradio-
gram of the lanes from a sequencing gel. Shown on the right is the inference of the corresponding DNA sequence.
and will not move as far. Electrophoretic methods are sensitive enough to discern the difference in length between DNA
molecules that differ by a single nucleotide.
O
O P O− O
"Ob Base
O–CH2 "
" b
b O P O−
T
T "Ob Base
O–CH2 "
" b
b
T
O T
O P O−
"Ob
" b
Base O
O CH2 —" b
O P O−
T
T "Ob
" b
Base
O CH2 —" b
OH T
T
O
O P O−
"Ob
" b
Base
O O O O CH2 —" b
T
T
O P O P O P O−
""Ob
b
Base OH
O− O− O–CH2 " b
T
T
OH
Figure 3.2: The normal process of DNA replication. Only one chain of the sequence is diagrammed (the template strand is
not shown). The polymerase catalyzes the addition of nucleotide triphosphate (bottom left) to the growing strand leading
to a larger molecule shown on the right.
O O O
O P O P O P O−
""Ob
b
Base
O− O− O–CH2 " b
T
T
Figure 3.3: A dideoxynucleotide triphosphate. This nucleotide will be incorporated into a growing sequence strand but
because it lacks a 30 − OH, this nucleotide will block further addition.
Elementary Sequence Analysis
30 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Dideoxynucleotide
G A T C
−ve Inferred DNA sequence
P 32
C T T C AGT AC GT C G
P 32
C T T C AGT AC GT C
P 32
C T T C AGT AC GT
P 32
C T T C AGT AC G
P 32
C T T C AGT AC
P 32
C T T C AGT A
P 32
C T T C AGT
P 32
C T T C AG
P 32
CT T CA
P 32
CT T C
P 32
CT T
P 32
CT
P 32
C
+ve
Figure 3.4: The Sanger method of sequencing DNA. The black bars indicate what would be seen in an autoradiogram of
the lanes from a sequencing gel. Shown on the right is the inference of the corresponding DNA sequence.
requires a primer for DNA synthesis and then interrupts this process at points corresponding to the linear sequence of
nucleotides. The method was developed by F. Sanger & colleagues and makes use of dideoxyribonucleotides which will
be incorporated into a replicating molecule at random positions (F.Sanger, S.Nicklen, A.R.Coulson, 1977, Proc. Natl.
Acad.Sci. 74:5463). The dideoxynucleotides lack the 30 OH on the sugar. The diagram in Figure 3.2 shows a cartoon of
the normal process of DNA synthesis. DNA nucleotides are normally 20 -deoxynucleotides and have an OH group at the 30
carbon. With the addition of nucleotide triphosphate, a polymerase will catalyze a reaction indicated by the red arrow where
the OH is exchanged for bond with the phosphate group of the next nucleotide in order (according to the complementary
strand which is not shown in this diagram). Sanger’s method makes use of 20 ,30 -dideoxynucleotide triphosphates (Figure
3.3) and the 30 carbon the point where the next nucleotide attaches via the formation of a phosphate bond (“O - P - O”), the
polymerase will stall at the point of addition of the dideoxynucleotide. But even if the polymerase still has proof-reading
activity, it will not rapidly excise the dideoxynucleotide because the corresponding bases are correctly hydrogen bonded.
Again, four individual reactions containing one of the four dideoxynucleotides can be constructed and the sequence can
again be inferred. In this case, the radioactive label can be attached to the primer.
The Sanger method therefore creates a collection of DNA fragments that are blocked at random points by these dideoxynu-
cleotides. Like the Maxam-Gilbert method it too has four reactions mixtures that are each run in a different lane of a gel.
The method originally required fairly large volumes and the dangerous use of radioactive labels. Cloning DNA fragments
to generate sufficient raw material of a single DNA molecule was difficult. Reading the resulting autoradiograms became
a tiresome task that many a graduate student has complained about.
More recent improvements have overcome many of these problems. First the chemistry has become more standardized
and reaction volumes have become smaller. The PCR (polymerase chain reaction) was able, in most cases, to replace any
requirement for cloning by generating large quantities of a template. Instead of a radioactive probe attached to primers,
fluorescent probes are used. Using four different fluorescent colours, you can combine the reactions into a single lane on
a gel. You can shrink the size of the lane to a capillary. Then as the DNA fragments migrate within the electrophoretic
field, the fluorescent probes can be excited by a laser and their emitted light can be detected and automatically measured
by a photometer. The intensity is measured as the run proceeds and is automatically stored into a computer. An example
of a sequence chromatograph is shown in Figure 3.5-3.7 (this chromatograph comes from the bacterium Sinorhizobium
meliloti). The resulting bases can be inferred by a computer program and automatically analyzed.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 31
As with most human endeavors the process of DNA sequencing is not 100% accurate. The beginning of a sequence run (or
trace) is usually too poor to permit inference of the DNA sequence. Also as the mixture of DNA fragments is run for an
extended period of time, the electrophoretic resolution of the fragments becomes poor and identical fragments will migrate
to different distances in the gel. This causes the trace for each nucleotide to spread out and become broader. This itself is
not a problem but as the height of the chromatograph peaks shrink and as their overlaps become more extensive, the ability
to determine which nucleotide is followed by which becomes more difficult.
In addition, a poor trace can result from many different factors. For example, if there is a repetitive region being sequenced,
the polymerase might stutter as it goes through the region. Alternatively there might be more than one template being
sequenced. In either case, the trace will contain more than one sequence superimposed and it will be impossible to
correctly call the sequence (but under good conditions, base substitution polymorphisms can be detected).
Compression is a common phenomena in DNA sequencing. This occurs when two (or more) guanine nucleotides appear
in the sequence in a row, these bases will stack together and appear much closer electrophoretically than would a mixture
of other nucleotides. Since base calling makes use of the separation between peaks it can be fooled into calling a single
base present with a wide peak rather than two bases present each with peaks pushed together. For all of these reasons it
usually necessary to sequence the same segment of DNA from the opposite direction to ensure that the nucleotides have
been correctly determined.
To deal with these errors the software that make base calls also try to estimate the probability of errors in these calls. The
most common way to measure errors is to use a so-called Phred score, Q. This score is named after the software package
of the same name written by Phil Green. This was originally done using a series of lookup tables that were hard-coded
into the software. These tables made use of several characteristics of the appearance of the trace and what the trace files
for sequences with known errors looked like. Today most manufacturers of sequencing hardware will include software that
estimates the error rate for their particular machines. The quality scores are usually expressed as a Phred score even though
the method of calculation might be quite different.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 33
The Phred score gives an estimate of the probability of an error, e. The two are related by
10% error, Q = 10
1% error, Q = 20
0.1% error, Q = 30
0.01% error, Q = 40
. . . and so if the probability of an error (a base miss-call) at a particular site is 20%, (a very questionable base call) then
the Phred score would be Q = 6.9. The Phred score is used to assess sequence quality, to recognize and perhaps remove
low-quality sequence in automated programs, to aid joining of overlapping reads (particularly important since the ends of
reads often contain more errors), and in the determination of accurate consensus sequences.
while the circles at the ends of the reads indicate that vector sequence has been trimmed from the end of the reads. The bar
to the left indicates the progress of the sequencing for this region of the consensus. The yellow strip on each side of the bar
indicates good coverage in both forward and reverse directions. The blue colour indicates limited coverage in one direction
and the black colour indicates that there is no sequence in that direction. The red strips in the bar indicate an unresolved
disagreement between the reads for a particular base. Note that although there are many coloured boxes on the individual
reads indicating disagreements between the reads, these are generally resolved by multiple reads and result in only a few
red bars.
As the sequences for the genome accumulate, a consensus among individual reads is found by computer. This consensus
grows in size as new reads are made and as they overlap in their sequence. It is a time consuming process to take each read
and determine if and how it might overlap with the other reads. Intelligent algorithms have been developed to carry out
this process.
As the reads are put together, the consensus sequence will grow in length. These growing chunks of sequence are called
“contigs” (contiguous regions of sequence). An example of contigs are shown in Figure 3.13. The individual reads are
shown on the right of the figure. The blue arrows show a contiguous overlapping consensus sequence, with the largest
region at the top moving down to smaller regions and with singleton reads at the bottom. Previous contigs joined together
in this analysis are shown by the black arrows to the left of the blue contigs.
One would hope that with enough reads the contigs will be joined into a single sequence that would represent the entire
chromosome. However, at some point, there are diminishing returns and it is more efficient to target a particular gap
between contigs to join them together. This can be done by taking the sequence at the end of a contig and making
sequencing primers that would extend beyond the limit of the contigs. Sometimes other more devious measures have to be
applied to fill these gaps. Sometimes they simply cannot be filled. This is the case for many eukaryotic sequences. The
centromere of many eukaryotes consists of short sequences repeated up to a million times. There is no reason to sequence
through these (ignoring the difficulties of actually doing so) and hence they are intentionally left as gaps in the sequence.
The next step in most genomic sequencing projects is to figure out (at least in a preliminary sense) what the sequence does.
That is, where are the genes, where are structural features such as repeats, signal sequences and so on. In prokaryotes this
is comparatively easy since their genes are contiguous along the sequence and are without internal gaps. In eukaryotes,
the genes are interrupted by the presence of introns and the individual exons of genes may be separated by long distances.
Even with prokaryotes however, there are no flags sitting in the DNA stating that this is a gene. Some of the methods of
Elementary Sequence Analysis
36 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 3.14: Potential coding regions must be found — in this case using a Hidden Markov Chain method called FrameD
Figure 3.16: The Mme1 restriction enzyme creates staggered cuts at a distance from the recognition site
hybridization of the patient’s DNA to these oligos is quantified. Different methods make use of a gain of hybridization
signal to an oligo containing sequences known to cause the disease. Other methods make use of the loss of a hybridization
signal to perfect match oligos. The tricks here involve construction of a large number of oligos and the subsequent scanning
of the degree of hybridization to each oligo. The ultimate goal of this methodology would be to create a “universal array”
that contains all possible oligonucleotides. Although quantifying the presence of all possible oligos does not permit the
determination of a new genome sequence, it can be used to determine the sequence of a variant of a known sequence.
Theoretically at least, Pe’er et al. 2002 PNAS 99:15492 have shown that an array consisting of just 8’mers is sufficient
(84 = 4096) to resequence targets of more than 2kb (as will be seen below, an array this size is easily achieved).
Still other methods of resequencing being explored make use of primer extension reactions to perfect match oligos. These
oligos are then arrayed on a surface (e.g. see section 3.11.1) and sequencing is performed on this surface. The dideoxyri-
bonucloeside triphosphates are added such that each is labelled with a different fluorescent dye and then fluorescent mi-
croscopy is used to assign the identity of target nucleotides extended from the 30 end of oligo (Pastinen et al. 1997 Genome
Res 7:606).
Another method being explored is to make use of the developments in mass spectrometry. Matrix-assisted laser desorption
ionization time-of-flight mass spectrometry (MALDI-TOF MS) combined with methods to ionize macromolecules using
electrospray ionization. Normally creating ions of macromolecules has been difficult but advances in laser technology and
ionization methods have made this possible for fragments of DNA. The advantage of a mass spectrometry method is that it
is highly repeatable and consistently accurate. This is particularly useful with DNA fragments that are difficult to sequence
through gel electrophoresis and in fact can be used to sequence RNA molecules (for a review see Edwards et al. 2005
Mutation Research 573:3). This method also has the ability to resequence small genomes and could be useful in clinical
applications (Tost and Gut 2005 Clin Biochem 38:335).
To resequence large genomes a method has been developed by Shendure et al. 2005 Science DOI: 10.1126/science.1117389
that can (in principle) handle an entire bacterial genome. Their method begins by size selecting randomly sheared 1kb
fragments from the genome. These are ligated to a universal linker under conditions that will result in both ends of the 1kb
fragments being ligated to the ends of the linker (creating circular molecules).
The linker contains a Mme1 restriction site at each end. Mme1 is a restriction site that recognizes the sequence 50 -
TCCRAC-30 and then creates a staggered cut 20 bases in the 30 direction on the 50 -30 (upper) strand and 18 bases away
in the 30 direction on the 30 -50 (lower) strand (see Figure 3.16). Cutting the circular construct with this enzyme creates a
molecule that contains the linker with 18 bp of genomic sequence at each end. Universal amplification/sequencing primers
are then added to each end. Hence, this results in 2 × 18bp of genomic DNA flanked and separated by universal primers
that are used for amplification/sequencing. These two pairs of 18bp are approximately 1 kb apart in the original genome.
These primers are used to amplify this construct. The construct is attached to a 1µm-bead (to learn about bead technologies
see at www.lifetechnologies.com or see the company’s brochure or, less informative, their video for a quick introduction
on surface activated beads). The amplification is done using ePCR – “e” standing for emulsion PCR. Emulsion PCR is
standard PCR but done in an oil-water emulsion such that each bead is likely to occupy a single water droplet. All amplified
fragments will then attach to the bead, resulting in a bead that has many copies of a single fragment.
They then use an odd method of determining the sequence in these short fragments. They wish to avoid the cost of
acrylimide sequencing. Instead they use oligo’s that have specific fluorescent bases at a different positions (for details see
their paper). Using these they can determine the sequence of the first 6 bp and the last 7 bp of each of the two 13-mers. A
computer then puts these small fragments onto an already known genome.
As a demonstration of this technology they resequenced E. coli for SNP’s in an evolved strain. They collected 30 Mb after
60 hours of instrument time (2.4 days). This technique is good for resequencing of bacteria. It will need to be enhanced to
Elementary Sequence Analysis
40 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
A B
Figure 3.17: To accomplish pyrosequencing templates are attached to beads in individual wells (A) and surrounded by
smaller beads with attached enzymes (B; these figures are from Margulies et al. 2005 Nature 437:326)
permit eukaryotic resequencing because the 1 kb distance is not long enough and the sequence determined is too short to
correctly place some repeated elements. This methodology has been largely overtaken by those listed below.
One exciting method to sequence DNA de novo has been developed and patented by the company 454 Life Science
Corporation (owned by Roche). This method was originally described in the article Margulies et al. 2005 Nature 437:326.
They make use of a method that can detect the released pyrophosphate when a nucleotide triphosphate is added to a
growing chain (Figure 3.2). They use the enzyme sulfurylase to catalyze the P P i to ATP. The concentration of ATP is
then sensed making use of the firefly’s luciferase enzyme. The amount of light produced is measured by a sensitive CCD
(charge-coupled device) camera and should be in direct relation to the amount of P P i released and hence of the the ATP
concentration.
The next trick that they use is to amplify individual fragments from a genome. They do this by randomly shearing the
genome into fragments. Fragments are then covalently ligated to a four nucleotide marker/primer fragment. Each fragment
is then bound to a single bead by ensuring an excess bead concentration. Then a PCR reaction to amplify random fragments
using the ligated primers is performed but again it is an ePCR done in an oil/reaction-mixture emulsion such that each bead
will uniquely occupy a single droplet. The result is that only one fragment is amplified per droplet and all the amplified
copies become attached to a single bead.
The beads are placed in a matrix containing wells that can each hold only a single 28-µm bead (Figure 3.17A). The
matrix is 60mm × 60mm (a square approximately equal to the size of the small side of a credit card) and should contain
approximately 1.6 million wells. Smaller beads are added that carry immobilized enzymes required for sequencing and
required for the generation of fluorescence (Figure 3.17B).
In successive waves the matrix/slide is washed with a solution of a single nucleotide triphosphate, then a wash solution,
followed by the next nucleotide triphosphate and so on. During each wash the fluorescence of the well is measured and sent
to a computer. The computer quantitates the level of fluorescence and calls the number of nucleotides of that particular type
added in this well. By quickly washing the matrix/slide and measuring the addition of the next nucleotide triphosphate, the
technique can carry out shotgun sequencing of an entire genome.
In the Margulies et al. 2005 Nature 437:326 article, the authors demonstrate the technique by resequencing the genome of
Mycoplasma genitalium. Their run through the instrument took 243 minutes for 42 cycles of reads/washes. The total read
lengths after these 42 cycles were on average 108bp long (multiple bases can be added per cycle; e.g. if there are three
A’s in a row in the template). This run generated over 47 million good quality bases read. Thus it took just four hours to
sequence the entire genome (neglecting gap closure). Indeed the authors state that they repeated the whole process eight
times yielding a 320-fold coverage of the genome.
Another method called SOLID is from Applied Biosystems (ABI) and starts at the same point as the Roche/454 system
with emulsion based PCR. The beads however are covalently bound to a glass side; approximately 100 million of them.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 41
The sequences to be determined are ligated to two adapters, one at each end. The beads have the compliment of one adapter
and in this way the sequences are hydrogen bound to the beads for the emulsion PCR.
Then instead of pyrosequencing it uses ligation sequencing. It takes a primer has homology to the adapter at the 50 end
of the sequnece. Then a mixture of oligmers with a linked flourescent dye are added. Each mixture of oligmers differ
importantly at their 30 end and in the colour of the flourescent dye at the 50 end. The oligmers are eight base pairs long.
After the first two bases, the next three are redundant (mixtures of all nucleotides at these sites), followed by another three
bases of ‘universal’ bases. Each oligmer competes for hybridization to the sequence via the first five base pairs and then
is ligated to the primer. There are four dyes for the sixteen possibilities of di-nucleotides at the 30 end, so the oligmers are
redundantly labelled (AA, CC, GG, TT are one colour; AC, CA, GT, TG another colour; AT, CT, GA, TC another; and
AT, CG, GC, TA another). After the flourescence is measured, the end of the oligmer with the dye is cleaved off along
with the three ‘universal’ base pairs, leaving a new spot for potential ligation. Another ligation cycle is then performed,
then flourescent detection, then cleavage and so on. How many cycles will determine the read length. With seven cycles
yielding a 35 bp read length.
After the ligation cycles, the primer is removed and a new primer is added that is one base shorter (n − 1). The whole
process is repeated. Then this primer is removed and an n − 2 primer is used. Then an n − 3, and then an n − 4 primer
(this process is called primer reset). Using this process each base is queried more than once, with each base being read by
two overlapping dinucleotides started from two different primer resets.
In total this yields 20 gigabases of data per run. The method is very suitable to re-sequencing projects. The technique is
also flexible to improvements with more ligation cycles (50 bp reads being done) and more beads per slide.
In addition to the 454 and SOLID sequencing methods another promising method has also been recently developed for
resequencing genomes. This method patented by Solexa (also known as “Sequencing by synthesis” and now owned by
Illumina) is based on parallel sequencing of small DNA fragments bound to a solid surface. In comparison to the 454
sequencing protocol which involves the successive use of each nucleotide independently, the Illumina protocol uses all
four nucleotides at the same time. These nucleotides are known as terminator nucleotides (see Turcatti et al. 2008, Nucleic
Acid Research 36(4): e25). A fluorescent dye linked to the 30 OH end prevents the incorporation of a second nucleotide
during a cycle. After washing the unincorporated nucleotides, the fluorescence of each incorporated nucleotide is detected
and the dyes cleaved and a new cycle is started.
The Illumina sequencing process includes two different steps taking place on a slide: the amplification of the DNA frag-
Elementary Sequence Analysis
42 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
ments and their sequencing. The genomic DNA is first fragmented and end repaired (in order to obtain blunt-ended DNA)
through the ligation of adapters at both ends (Figure 3.19). Single strand DNA fragments are then bound to the flow cell,
and amplified on the cell using solid-phase bridge amplification. This process leads to clusters containing up to 1000 copies
of a DNA fragment. A cell can contain up to 10 million of clusters per square centimeter. The sequencing step is realized
by adding the four terminator nucleotides followed by the detection of their incorporation in each cluster. This technique
allows about 30 millions reads of 35 bases each (about 1 Gb in total) within about 90 hours from sample preparation to
data collection. This technology is used not only in genome sequencing and resequencing but also in barcoding, gene
expression, small RNA identification, it can also be combined with chromatin immunoprecipitation (ChiP) analyses.
The HiSeq 2000 version of the Illumina machine was announced in January 2010 and boasts up to 200 Gb per run (for
this quantity each run takes 8 days), 2 × 100 bp read length, or up to 25 Gb per day, two billion paired-end reads/run. They
claim that “in a single run, sequence two human genomes at 30x coverage for less than $10,000 (USD) per genome, or
perform 200 gene expression profiles for less than $200 per sample”. Meanwhile, ... Roche has extended their read lengths
up to 1000bp with a modal length of 766bp ... reaching the same useful lengths as traditional Sanger sequencing but with
amazing throughput.
Aside of the technical differences between the 454 and the Illumina Solexa techniques, the major differences between
these two sequencing techniques are in their outputs. Longer reads are obtained with the 454 in comparison to the Illumina
Solexa technique (old version: 250 bases vs 35 bases; new version: 1000bp vs 100bp respectively) while the Illumina
Solexa protocol leads to a larger number of reads. These second generation sequencing techniques have considerable
advantages over the traditional sequencing techniques as they are faster, produce more data and have therefore lead to a
huge decrease in the cost of sequencing.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 43
In order to sequence complete genomes it is necessary to map the sequences onto the genome. That is, the physical location
of any one read must be determined. Unfortunately, most genomes contain sequences that are highly repetitive. The same
sequence might be present in multiple physical locations around the genome. In the case of the human genome, there
is extensive redundant repetition with the same short sequence (up to 5000bp or more) dispersed around the genome in
millions of copies. If you have a 700bp sequence from one of these repeats, how do you know which of the millions of
locations that this sequence read came from.
The trick to sequence these regions is to create, what are known
as, paired reads. For traditional Sanger sequencing, the genomic Figure 3.20: Paired end sequencing: Adapters (A1 and A2)
DNA could be sheared to known fragment lengths, say 10kb with sequencing primer sites (SP1 and SP2) are ligated onto
(although any other length is feasible). The fragments are run DNA fragments. Template clusters are formed on the flow
on a gel and then a region corresponding to 10kb is cut from cell by bridge amplification and then sequenced (modified
the gel. The DNA is then eluted from the gel. This eliminates from http://www.illumina.com/technology/
paired_end_sequencing_assay.ilmn)
fragments shorter than 10kb and eliminates fragments that are
longer than 10kb. These fragments that are then cloned into a
sequencing vector. Then sequencing primers are added that read Genomic DNA
out from the vector into the cloned 10kb genomic fragment. The shear to 200−500bp fragments
trick is to add primers that read in from both ends of the cloned
fragment. Although Sanger sequencing will not read 10kb, when
sequenced from each end, two reads are obtained and it is known Ligate adapters
that these reads are approximately 10kb apart in the genome. An
even harder trick is to get the assembly software to account for A1 SP1
Obviously this problem becomes more difficult with the shorter Generate clusters
reads that are generally obtained from second generation se-
quencing technologies. With shorter reads many more repeats
SP2 A2
become problematic. Long reads could anchor short repeats
such as micro-satellites that would confound short reads. An- Flow cell
SP1 A1
fashions. They distinguish between what they call “paired end SP2
A1
Biotinylate 5’ ends
Ligate adapters
Biotin
A1 SP1
Biotin
SP2 A2
SP2 A2
Flow cell
SP1 A1
SP1
A2
Flow cell
SP2
A1
the reads and it takes care of the hard job of assembling these sequences reads.
Both Roche and SOLID have similar tricks to generate paired ends reads.
Pacific Biosciences has developed a method to follow the progress of a polymerase on a single molecule. They fix the
polymerase in place and then use fluorescent dyes attached to the phosphates on the nucleotides. As the polymerase
attaches the next nucleotide there is a high residency time for the fluorescent dye in a microwell. This is detected by the
sequencer and recorded. The result is the recording of the progress of a single polymerase as it replicates a single template.
Pacific Biosciences notes that they can achieve very long read lengths in excess of 10,000 base pairs (those these are rarer
than shorter read lengths), that the synthesis is very rapid (multiple bases (1-3) per second) and of course the whole process
is massively multiplexed in parallel (80,000). The method does suffer from a high error rate but this can be eliminated by
multiple reads of the same template.
In 2009 a different approach was used in a paper from Helicos Biosciences. It involves re-sequencing and reports in Nature
Biotechnology the re-sequencing of a human genome, (Stephen Quake’s; the founder of Helicos), for an estimated $50,000,
taking approximately four data collection runs and one operator. The methodology achieved a 28× coverage with an error
rate estimated as 1/20,000.
The Helicos sequencing method worked by splitting the DNA into single strands and breaking the strands into small
fragments that on average are 32 (24 to 70) nucleotides in length. Their methodology does not involve amplification but
rather depends on sequence from single molecules (Braslavsky et al., 2003). The fragments are affixed onto a glass slide.
On each of those tethered strands a new strand is synthesized again by pyrosequencing. The fluorescence generated is
captured by a microscope and monitored for each of the billion DNA fragments. A computer then matches the billions of
32-unit fragments to the completed human genomes already known. Their data indicated 2.8 million SNPs and 752 regions
of copy number variation (CNV) for this one genome.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 45
A different method is used by Complete Genomics Inc. On September 9 2009 they announced
MOUNTAIN VIEW, Calif. Sept. 9, 2009 Complete Genomics Inc., a third-generation human genome
sequencing company, announced today that it has sequenced, analyzed and delivered 14 human genomes to
customers since March 2009. Considering that fewer than 20 genomes have been sequenced and published in
the world to date, this is a significant advance both for Complete Genomics and medical research.
Their methodology is again a ligation approach to sequencing and used DNA nanoballs with fluorescence. As of October
1 2010 they announced 400 human genomes complete. The company now reports to be able to sequence 400 human
genomes per month (Aug 2012) for costs as low as $5,000 for a 40× coverage.
In a recent publication (July 2012) Complete Genomics accomplished accurate whole genome sequences and, more
importantly, haplotyping from just 10-20 human cells. They accomplished the haplotyping by sequencing from multiple
highly diluted libraries. Their “long fragment read (LFR)” technology includes reads from single DNA fragments of 10 to
1000 kb in length. Because of the dilution and multiple reads from duplicate libraries from a single individual it is possible
to identify the phase of even new mutations. They claim an error rate of 1 in 10 million bases; sufficiently accurate for
clinical applications involving new mutations.
During polymerization an H ion is released ... ... and detected by a sensor under each well.
coupled to the nanopore. As the exonuclease cleaves individual bases from a DNA strand, the cleaved bases can pass
through the pore. In the pore the bases momentarily interact with a cyclodextrin molecule engineered into the nanopore.
When they do so, they locally disrupt the electrical potential of the bilayer. The disruption can be detected by a chip and
is characteristic for each base. Further, the company notes that modified bases (e.g. methylated cytosine) can be detected
directly. The technology permits long reads to be obtained with the longest reported to date (Aug 2017) being a 950kb
read.
Their MinIon flow cell now (May 2017) comes in a new version called Flongle (flow cell dongle) for clinical diagnostic
applications. The development of the Flongle will also help the company’s work on the SmidgIon, a miniature sequencer
with small flow cells that is powered by a mobile phone. The company said the newest MinIon will produce more than
20 gigabases of data in 48 hours (in the company’s hands). The company reports raw reads with greater than 95 percent
accuracy and operating at a speed of 450 bp per second. The company has also announced the launch of the GridIon X5, a
desktop nanopore sequencer that can run up to five flow cells at a time and will have an initial output of up to 100 gigabases
of sequence data per 48-hour run. The platform comes with a local high-performance compute cluster that allows for
real-time basecalling and data analysis.
Still others are developing DNA sequencing straight from electron microscopy and others developing peptide sequenc-
ing from mass spectrometry. Still others are developing new sequencing technologies with hybridization methods (e.g.
GnuBIO) and so on.
There are problems with each of these methods but they are in early stages and each method is being actively improved.
Together they hold the promise that in a few years/decades you will go to your doctor’s office, they will take a pin-prick
of blood and your complete genetic profile will be determined within hours. The goal of these companies is to reach a
methodology that enables human genome level sequencing for just $1000 dollars. Others, such as RainDance Technologies
founder David Weitz claim to be developing methodologies that would “sequence a human genome 30 times for $30.”
(posted 7th June 2010) using microfluidics.
The bottleneck with such cheap genomes and with their ready availability has become our ability to understand and interpret
the mounds of data that the technology reveals. This is a current, very big and tangible bottleneck even with current
technology.
isolated from an individual can be hybridized to this array and non-matching DNA is washed away. The hybridized bits
are then eluted off the array and sequenced. For example, the company NimbleGen makes available a SeqCap EZ Exome
Libraries which is a solution based capture method while Affymetrix offers exome arrays.
An alternative method is to use an oligo-library to which have been attached streptavidin beads. Again the hybridization is
done with sample DNA and then the beads are captured and the hybridized DNA washed off.
Exome sequencing has been used to discover the mutation responsible for Mendelian diseases (Ng et al. 2010, Nat Genet
42:30) and for clinical diagnosis. Subsequently it has been used, for example, to discover the mutation that causes familial
autosomal dominant chronic candidiasis in humans, to determine the cause of hereditary progeroid syndrome, and an Alu
insertion causing retinitis pigmentosa in humans.
A major limitation of exome sequencing for disease discovery is that it analyzes only a small portion of the genome. For
the most part, splice variants, expression variants and many copy number variants are lost.
The idea behind Restriction site Associated DNA (RAD; Baird et al. 2008) tag sequencing is to use a genetic marker
associated with a restriction site. Hence the genome can be analyzed with reproduce-able markers that combine the massive
abilities of modern sequencing to produce data without having to sequence the entire genomes. The RAD tags are the
sequences that flank each restriction enzyme site throughout a genome. How much sequencing you wish to do will influence
which restriction enzyme(s) are chosen.
To accomplish this method requires isolation of the DNA with the particular restriction sites. This can be done via columns,
beads or by ligating Illumina adapters straight onto the restricted DNA. With multiple barcode linkers it is possible to do
RAD-tag sequencing for a large number of samples simultaneously.
This method was improved by Peterson et al. (2012) to use double restriction digests (ddRADseq). The digests are via
a rare cutter and a common cutter. The former cuts are used as the associated restriction sites for sequencing while the
latter enzyme used to avoid random shearing and to provide consistent fragmentation of the genome. In addition, they used
robotized size selection of the resulting fragments again to ensure consistent and reproducible fragments. By tuning the
enzymes used and the size selection, fragments from hundreds or from millions of regions genome-wide can be analyzed.
Perhaps the most common use of RAD tags is to search for SNPs among multiple individuals. They are also commonly
used to evaluate genome wide levels of divergence and polymorphism, and for QTL mapping.
3.10.3 BAsE-seq
BAsE-Seq stands for “Barcode-directed Assembly for Extra-long Sequences” (Hong et al. 2014). The motivation for this
method is to determine the haplotypes of viral genomes using short read technology. While there are long read technology
they are currently more error prone and more expensive. While short reads can find SNPs that may be important clinically,
the phase of this SNP in comparison to other SNPs might affect pathogenicity.
So the trick to getting this method to work is to barcode a particular genome with a 20bp random sequence, amplify the
genome so that you have many copies of the genome with the same barcode and then to associate the barcode with different
parts of the genome.
With the random barcode in place, a biotin label is attached to one end of the amplified fragments and acts as a block to an
exonuclease. The exonuclease is added in such a way to create a uniform distribution of genome fragments all containing
the same barcode. The individual molecules are then circularized via intra-molecule ligation. This construct now has the
barcode next to the location where it was originally inserted but in the other direction, the barcode is next to a (hopefully)
random part of the genome generated from where ever the exonuclease stopped processing. By sequencing these molecules
the resultant short reads will each have a unique barcode that determines the original viral genome.
Hong et al. (2014) applied this method to a clinical sample of hepatitis B virus and obtained over 9,000 viral haplotypes.
This gives an unprecedented view of viral population structure during chronic infection.
Elementary Sequence Analysis
48 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
3.10.4 RNA-seq
RNA-seq refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about
the genes being expressed in a cell at anyone time.
There are many ways to isolate the mRNA from a cell and these often involve ‘kits’ and often they are customized for the
particular application and for the particular sequencing platform. In general the first step is to isolate RNA away from DNA.
Usually rRNA is a large fraction of the RNA in a cell and so these are removed via probe hybridization. The remaining
RNA is reverse transcribed and made into cDNA (copy-DNA). The cDNA is then sequenced.
Next-generation-sequencing technologies permit deep coverage and base level resolution. Even comparatively rare mes-
sages can be detected. RNA-seq provides researchers with efficient ways to measure transcriptome data experimentally,
allowing them to get precise information of how different alleles of a gene are expressed, to detect post-transcriptional mu-
tations, to identify gene fusions and even to detect individual cell differences in transcript production. It avoids the biases
and uncertainties of hybridization inherent in microarrays and, unlike micoarrays, provides absolute numbers to estimate
transcript levels.
3.10.5 BS-seq
Bisulfite sequencing makes use of the chemical bisulfite to alter DNA sequence in a fashion that depends on its state of
methylation. Most mammals methylate their DNA as a way to control levels of transcription (usually as repression). In
mammals we tend to methylate the 50 position of cytosine but preferentially at the dinucleotide CpG.
It has been found that spontaneous deamination of cytosine occurs frequently and results in a uracil residue in the DNA.
Repair enzymes recognize that uracil should not be present and tend to repair the aberrant. If, however, the cytosine base is
methylated then the deamination product is thymine and results in a C to T mutation. Thus CpG dinucleotides are retained
in places where it is hypomethylated and this results in CpG islands (regions of high frequency) that can be used to indicate
the presence of highly expressed genes. In addition to this role in modulating gene expression, methylation is also used as
an epigenetic marker indicating which gene is paternal and which is maternal. Because of these roles, there is an interest
to determine methylation patterns.
Treatment of single stranded DNA with bisulfite will deamminate the cytosines to uracil. If this reaction is carried out to
completion all un-methylated cytosines will be converted into uracil residues. In order to sequence the DNA, the genome
is first sheared to short fragments and then Illumina adapters are ligated onto the fragments. These are then treated with
bisulfite, and bridge amplified. The result is a plate with T at sites that have C in untreated controls. The methylated C
residues will stay as C’s in both treatment and control conditions.
This then yields the methylation status of individual cytosine residues, yielding single-nucleotide resolution information
about the methylation status of a segment of DNA.
Interestingly, your epigenome varies with age, varies with tissue, is altered by environmental factors, and may show
changes in response to diseases. The mapping and understanding of methylation and other epigenetic markers will help
to understand how aging, tissues and diseases react to these markers. Specific patterns of methylation are indicative of
specific cancer types and could have diagnostic and treatment value.
3.10.5.1 TAB-seq
Methylation is not the only epigenetic marker in DNA. By applying similar techniques these too can be discovered, se-
quenced and mapped. As just one example, consider TAB-sequencing introduced by Yu et al. (2012). They are interested in
5-hydroxylmethylcytosine (5hmC). This base modification is necessary for normal mammalian development and in embry-
onic stem cell regulation. It is however, resistant to deamination by bisulfite treatment and hence cannot be discriminated
from simple 5-methylcytosine (5mC) by bisulfite sequencing.
In order to distinguish them, Yu et al. make use of two features. (1) they use TET proteins which oxidize 5mC to 5hmC
and then to 5-carboxylcytosine (5caC). When 5mC is changed to 5caC, it will be deaminated by bisulfite treatment. (2)
The addition of glucose to 5hmC will make beta-glucosyl-5-hydroxylmethylcytosine (5gmC) and the later is resistant to
oxidization by TET proteins. Hence by treatment with normal bisulfite sequencing they find all 5mC and 5hmC sites and
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 49
then a second treatment with TAB-sequencing will find all 5mC sites but not the 5hmC sites (as these will be protected by
5gmC).
Using this technique, they show that levels of modified 5hmC bases are high (while levels of 5mC modifications are low)
near but not on transcription binding sites. Additionally they found some other patterns of modifications whose significance
is still uncertain.
3.10.5.2 NOMe-seq
NOMe-Seq is a single molecule technique that looks at both nucleosome positions and DNA methylation. The assay
combines BS-seq to measure methylation patterns with a second enzyme, M.CviPI GpC methyltransferase, which will
function to methylate GpC dinucleotides but only if the enzyme can gain access to the GpC site. If this site is covered by a
nucleosome or other DNA-binding molecules, the methyltransferase will not function.
Combining this with BS-seq enables the patterns of methylation at CpG sites and GpC sites to be inferred and hence
to determine CpG island promoters as well as nucleosome positioning. Kelly et al. (2012) use this technique to map
nucleosome position around CTCF regions (an insulator that binds the consensus sequence CCGCGNGGNGGCAG and
who’s binding is disrupted by CpG methylation). They show an anti-correlation with CpG methylation and nucleosome
occupancy. They provide “genome wide evidence that expressed non-CpG island promoters are nucleosome-depleted.”
3.10.7 ChIP-seq
ChIP (Chromosome Immunoprecipitation) is a technique where the specific DNA that bind to proteins can be determined.
This includes transcription factors, enhancers, even modified histones. This method identifies which DNA sequences
control expression and regulation for other diverse genes. In the ChIP procedure, cells are treated with a reversible cross-
linking agent. The effect of this agent is to bind the protein tightly and temporarily, reversibly to the chromosomal DNA
where they would normally bind. The DNA is then purified and broken into smaller chunks by digestion or shearing.
Antibodies (either general or specific) are used to precipitate any protein-DNA complexes that contain their target antigen.
Elementary Sequence Analysis
50 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
After an immuno-precipitation step, unbound DNA fragments are washed away. At this point the crosslinking is reversed
and the bound DNA fragments are released.
The fragments can then be analyzed on a microarray chip (ChIP-on-chip) or via next generation sequencing. The latter
has the advantage that many biases in microarray hybridization are eliminated and far more sensitive data results can be
obtained, and their sequences can be analyzed to determine the DNA sequences that the proteins were bound to.
3.10.7.1 CLIP-seq
CLIP-seq (aka HITS-CLIP) is related to ChIP-seq and is a similar method to analyze RNA molecules associated with
proteins instead of DNA molecules. Again UV-crosslinking is used to bind RNA and the protein that it is associated with
in the cell. Following DNase treatment, immunoprecipitation is used to pull down the RNA/protein complexes and then the
RNA is reversed transcribed to DNA and sequenced. This method can be used with an antibody to Argonaute to identify
all microRNA targets. In general, the method provides transcriptome-wide RNA-binding protein maps.
3.10.9 Hi-C
In line with the use of sequencing to discover other aspects of biology, it can also be used to determine the 3-dimensional
architecture of the chromosomes within a cell. This was done in a Science paper by Liberman-Aiden et al. (2009)
This method takes native chromosome DNA and adds a protein (formaldehyde) to cross-link strands of DNA that are
physically close together. In this way the physical location of two strands is recorded and preserved by the protein. Then
the DNA is cut with a restriction enzyme, the ends are repaired and marked with biotin. Then a ligase is added to the
mixture under very dilute conditions which will favour self ligation. The DNA is purified with proteases, it is sheared to
the appropriate size for sequencing and the biotin associated DNA is pulled down onto streptavidin beads. The beads are
isolated, DNA eluted, adapters ligated and sequenced. This process is diagrammed in Figure 3.22.
The end result is that pieces of the DNA from two different strands of DNA that are physically associated near each other
in the nucleus are now available in a combination suitable for sequencing. Massively parallel NGS sequencing permits this
to be done on a genome wide (or in this case a nucleus wide) context.
The chromosomal origin of the two fragments are identified bioinformatically and then the frequency of contacts between
chromosomes can be determined and a correlation matrix constructed. Examples of the results are shown in Figure 3.23.
They find that intrachromosome contacts are more frequent than interchromosome contacts and that the former often occur
in blocks or domains. More recently Dixon et al. 2012 showed that these domains are correlate with regions constrain the
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 51
Figure 3.22: An overview of how Hi-C maps are constructed (From Lieberman-Aiden et al. 2009 Science 326:289)
Figure 3.23: An example of a Hi-C chromosome interaction map. (C) An intra-chromosome map of chromosome 14.
(B) An inter-chromosome map (red: above expectation, blue: below expectation; both from Lieberman-Aiden et al. 2009
Science 326:289)
Elementary Sequence Analysis
52 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
spread of heterochromatin. Their boundaries are often enriched for the insulator proteins (CTCF), housekeeping genes, and
SINES. Lieberman-Aiden et al.’s results further suggest, for example, that the small human chromosomes tend to cluster
in the nucleus (Figure 3.23). In addition, they determined that the chromosomes within the nucleus tend to have a fractal
structure rather than random and that this helps the chromosomes to dissociate in a knot-free fashion.
Many new opportunities exist in the application of this technique. Beware however that this is a relatively new technique,
has significant technical challenges (completely glossed over here) and provides new bioinformatic challenges to determine
the 3D structure of the DNA.
And there are many other ‘seq’ methods including cel-seq, scRNA-seq, STAT-seq, DMS-seq, Frag-seq, Drop-seq, DroNC-
seq and so on – how good is your imagination?
3.11.1 Microarrays
A microarray is the placement of tens of thousands (sometimes hundreds of thousands) of molecular samples into a small
array. The goal of most microarray experiments is to analyze gene expression levels. This generally involves only the
transcriptome level and hence any discrepancies caused by differential translation of transcripts are not taken into account.
There are several types of microarray and the variety grows each year so I can only illustrate one example here. I will
illustrate here only the glass slide microarray with affixed cDNA sequences. The concept here is to use a silica coated slide
(the same size as a standard laboratory glass slide; 25 mm by 76 mm). With this single slide it is possible to monitor the
expression of hundreds of thousands of genes.
The first, and often most difficult step, is to collect samples of genes from an organism. Total genomic sequence projects
provide this information but not usually suitable sequences individual to each gene. Another way to get the sequences is
by collecting the mRNA from an organism (lets say humans) and to use a reverse transcriptase (an enzyme similar to a
polymerase but instead of copying DNA to DNA, it copies RNA to DNA) that will translate the mRNA into a DNA copy
(cDNA). Each of these cDNA’s are then cloned and amplified to make thousands of copies of each one (usually via a PCR
reaction).
These DNA copies of individual genes are then spotted onto a glass slide that is coated with a polylysine solution (or
an aldehyde coating, or a variety of other coatings) to which the DNA will adhere. The spotting is accomplished using
microscopic pins (or microliter ink jets (from the technology developed around commercial ink-jet printers)) that are
precisely positioned (robotically) on the slide. At the same time the identity of individual spots is recorded along with its
coordinates on the slide.
The next step is collect mRNA from a tissue of interest and in which the gene expression levels will be measured. Using the
microarray technology, absolute expression levels are difficult to determine, but relative expression levels can be estimated.
So a typical example would be to take mRNA from two different tissues, say a normal somatic human skin cell and
cells from a cancerous tumor. A fluorescent probe is attached to each cDNA constructed from the mRNAs. A different
fluorescent probe is attached to each sample of cDNAs. The cDNAs are hybridized to the microarray slide in such a way
that homologous cDNA molecules will attach and bind to the corresponding spot on the slide (see Figure 3.24).
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 53
Spotting Pin
cDNA samples
Red Green
Fluorescent Fluorescent
Probe Probe
Laser
Excitation
The fluorescence of each spot is measured by exposing the slide to two different lasers. Each laser emits a specific light
frequency that will excite only the corresponding fluorescent probe causing it to emit photons which can be captured by a
photometer.
By comparing the relative amounts of each fluorescent probe (red and green colours are traditionally used to visualize the
fluorescence) a measure of the gene expression levels for each gene can be obtained. In our example and in the combination,
a yellow spot indicates genes that are expressed in equal concentration in cancer and normal tissues. A red spot indicates a
gene that has been turned on in a cancer cell and a green spot indicates a gene that has been turned off in a cancer cell. The
total absence of a spot indicates that it is turned off in both tissues (and doesn’t attract non-specific binding) (see Figure
3.25 - 3.26). To aid the analysis of so many genes, a common statistical practice is to cluster genes with similar patterns of
expression in a hierarchical fashion (Eisen et al. 1998 PNAS 95:14863-14868). Then the scientist can immediately discover
which genes are coordinately expressed. An example of such of clustering is shown in Figure 3.27. Here the authors have
clustered across the top of the microarray about 5000 genes and on the left, have clustered 98 breast cancer tumors. This
indicates that different breast cancer tumors have different collections of genes up/down-regulated. The authors van ’t Veer
et al. (2002 Nature 415:530) were able to show that these tumors responded differently to chemotherapy. This difference
is undoubtedly due to these differential gene expressions.
In the end, with a single afternoon’s work (after a perhaps more substantial preparation time), a single laboratory could
generate more than half a million data points relating to the expression of hundreds of thousands of genes under different
conditions. This creates a large analysis task for the bioinformatician.
This short note does not scratch the surface of this technology. There are many types of surfaces to coat the slides. Many
types of ways to spot the cDNAs. Many ways to select the DNA to be spotted. Many ways to hybridize and label the
cDNA. Many ways to analyze the resulting data. On top of all that, as stated earlier, there are many other kinds of DNA
microarrays, of protein microarrays, of microbeads, and so on.
However, the use of microarrays to strictly examine gene expression has past its peak. Microarrays have always had
problems with a naturally high variance in the estimated expression levels. This is because the nature of hybridization
between molecules is a complicated process and many things beyond numbers of transcripts can influence it. Second, the
alternative hybridizations always forced the method to be strictly comparative. The use of new sequencing technologies
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 55
Figure 3.27: An example of genes clustered on the basis of their gene expression patterns (from van ’t Veer et al. 2002
Nature 415:530)
Elementary Sequence Analysis
56 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
is now being used as a much more powerful alternative. Millions of transcripts can be quickly/cheaply detected by direct
sequencing and with less bias (but still perhaps an unexplored level of bias). Papers such as this by Marioni et al. 2008
suggest that cDNA sequencing from a single Illumina lane is comparable to the data of a microarray but additionally
permits detection of low-expressed genes, splice variants and novel transcripts.
There are however other applications of microarray technology and currently they are used extensively as “capture-arrays”.
The idea behind these is to place an extensive collection of manufactured oligos onto an array. Sample DNA can then be
sheared and hybridized onto this array. DNA that does not match (hybridize) to the oligos on the array are washed away.
After elution the sequences that bound to the array can now be analyzed in isolation of any contaminating DNA. In this way
the DNA of a single bacteria can be obtained from a metagenomic sample that might contain billions of other genomes.
Another application of microarray technology is through the use of protein arrays. Again the idea is to start with a DNA (or
RNA) array and pass proteins over the array. Using this technique you can discover what proteins bind to what segments
of DNA/RNA (or perhaps the interest is in what proteins don’t bind). Either way, the specificity of protein binding can be
investigated.
Mass spectrometry is another area that has seen rapid advances and which provides a massive amount of information very
quickly. A basic mass spectrometer will take a sample of some chemical/compound and then ionize the chemical creating
a gaseous spray that is injected into a vacuum chamber which separates the ions on the basis of charge. A detector will then
measure how long it takes the ions (time of flight) to reach them – a quantity dependent of their mass and charge. Hence,
mass spectrometry is a technique designed to measure the mass-charge ratio of a compound. It can do this incredibly
accurately with mass accuracy measurements less than 1 part per million.
This basic technique has several variants. MALDI-TOF is a technique that uses matrix-assisted laser desorption ionization
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 57
Disease HIV−1
carcinoma cancer
duchenne muscular retinoblastoma
cystic fibrosis
obesity
Aldrich syndrome
leukemia
Hypoxia
breast cancer
Parkinson’s disease apoE
PrPc
SEPSIS
mental retardation PML
nerve growth factor
osteosarcoma apolipoprotein E
prion disease
CALCITONIN
Anemia
NGF
HEPATOMA
myeloma APP
Marfan syndrome APC
PrP
shock fribrilin−1
interleukin 6 Rev
RT
cyclin−dependent ...
Rb Leptin
LIF
Gene
CFTR
HIV p53
Vpr
Nef
Figure 3.29: Example of literature mining (modified from R. Feldman et al. Biosilico 1:69-80 2003)
(-time of flight) to create the initial ionization. These units have become relatively inexpensive and their descendants will
become common biology laboratory instruments. MS-MS is a tandem mass spectrometer which adds another chamber in
front of the detector with nitrogen or argon gas to collide with the ions and break them into constituent pieces. LC-MS
combines liquid chromatography with mass spectrometry while GC-MS combines gas chromatography. All to yield further
discriminant power. The latest technique is FT-ICR MS (Fourier transform - ion cyclotron resonance). This makes use of
fast Fourier transform to build the spectra from an analysis of the complete sample (rather than by combining individual
curves) and uses an induced resonance frequency of the ions to aid detection. This instrument is capable of measuring the
mass/charge of both large (proteins) and small (metabolite) compounds without extensive preliminary sample preparation.
While these methods are generally used to identify compounds it is also possible to use them to identify protein modifica-
tions and even to sequence a protein. This is done by purifying a protein, adding it to these instruments and they will break
it into constituent pieces. Since many intermediates of these fragments will also be present and since the Mass/charge of
the amino acids is known, a computer can quickly come up with a protein sequence that when fragmented would yield the
observed bands.
Using these techniques it is possible to determine the presence of any collection of proteins within a complex cellular tissue
(this requires knowledge of the pieces first – i.e. a genome sequence). And hence to gather a snapshot over time of a cell’s
entire proteome. Through multiple time course analyses it is possible to determine the changing concentrations of all these
proteins. And it is possible to examine the metabolomics of a cell – the concentration of smaller constituent molecules
such as sugars and alcohols. An example of such a mass spec is shown in Figure 3.28. Here the metabolites from the
leaves from a plant grown under high salt conditions are compared to the metabolites from the leaves of a plant grown
under normal salt conditions. There are obvious differences that can be utilized to determine how the plant has responded
to this salt stress.
Again, there are a large number of peaks in such chromatograms, representing a large number of compounds, influ-
enced/controlled by proteins and genes. Fertile ground for a bioinformatician.
Elementary Sequence Analysis
58 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Obviously this is yet another rich source of information. There are too many other technological advances to discuss them
all here. Most of them however, require a scientist with a bent toward quantification and data analysis.
Chapter 4
Databases
4.1 Introduction
Molecular biology has undergone amazing advances in the last twenty years. We can now sequence DNA and proteins in
most any laboratory in the country. Indeed it is sometimes even given to undergraduate students as an laboratory exercise.
Most universities don’t do this but this is because of the use of radioisotopes rather than the difficulty of the technique. The
ability to rapidly and easily sequence DNA has also lead to a shift in the way that science is now done. With automated,
massively parallel sequencing machines we would no longer give undergraduates an exercise to sequence something as
this is the realm of robots and not humans.
It has become easier to simply sequence the gene (sometimes a whole genome) as more preliminary information can be
often be gained this way than to carry out a sophisticated and well thought out experiment. These advances have lead to
the establishment of genome projects. In the past there were large scale projects to set up laboratories to sequence DNA
in an efficient way rather than having each laboratory do the sequencing in house. The largest of these projects was the
human genome project to which the United States government alone committed $3,000,000,000.00 (that is three billion!).
Other governments of the world also supported their own projects. Additionally, more organisms than just humans are
being examined and many have been sequenced. Within the last half decade, the cost of sequencing and the cost of the
sequencers has fallen so drastically that genome projects can be done by individual laboratories. Indeed, the estimated cost
to sequence a single human genome is approaching $1000 or less.
Computer technology has also undergone an amazing advance in the last twenty years. It is now unusual for scientists not
to have an extremely fast, multi-processor computer on their desk. Additionally, though somewhat contrary to popular use,
these computers are not simply fancy typewriters. They have many capabilities beyond word processing and can deal with
a large amount of information. They are also capable of doing analyses that are beyond the computational ability of any
scientist.
One of the other major advances in computer technology has been in connectivity. Computers are now connected to
networks that permit access to other computers all over the world. This of course allows students to chat with one another
but it also means that the data generated by these sequencing machines (and other biological instruments) are available to
anyone anywhere in the world. This permits anyone with a computer to access databases of all kinds - if they know how.
The purpose of this section is to provide you an entry point to this knowledge.
None of the genome projects, nor most of the other projects that create databases, would have been funded if their research
was kept private. Indeed an openness about research results has been a long standing principle that has guided science. It
is oft quoted that “the experiment is not finished until it has been published”. Publication has been the traditional method
of permitting worldwide access to research results. However, the retrieval of this information can be a labour intensive
practice that required great skill in the days of paper publications. Here, I am mainly referring to simple factual data rather
than experiments that require interpretation. To accumulate this factual data and to make use of it is often difficult. With
computer databases, however, this data is as accessible to you as it is to the expert that compiled it. You can bring the data
directly to your desktop in its entirety, cut/paste the pieces you want, and analyze it according to your fancy. An article by
Elementary Sequence Analysis
60 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
W. Gilbert in NATURE suggests that this combination of advances will lead to a shift in the way science will be done in
the future.
The steady conversion of new techniques into purchasable kits and the ac- by some direct experimental procedure - determined by some property of
cumulation of nucleotide sequence data in the electronic data banks leads its product or otherwise related to its phenotype - to clone it, to sequence
one practitioner to cry, “Molecular biology is dead - Long live molecular it, to make its product and to continue to work experimentally so as to seek
biology!”. an understanding of its function.
There is a malaise in biology. The growing excitement about the genome The new paradigm, now emerging, is that all the ’genes’ will be known (in
project is marred by a worry that something is wrong - a tension in the the sense of being resident in databases available electronically), and that
minds of many biologist reflected in the frequent declaration that sequenc- the starting point of a biological investigation will be theoretical. An indi-
ing is boring, and yet everyone is sequencing. What can be happening? vidual scientist will begin with a theoretical conjecture, only then turning
Our paradigm is changing. to experiment to follow or test that hypothesis. The actual biology will con-
tinue to be done as “small science” - depending on individual insight and
Molecular biology, from which has sprung the attitude that the best ap-
inspiration to produce new knowledge - but the reagents that the scientist
proach is to identify a relevant region of DNA, a gene, and then to clone
uses will include a knowledge of the primary sequence of the organism,
and sequence it before proceeding, is now the underpinning of all biologi-
together with a list of all previous deductions from that sequence.
cal science. Biology has been transformed by the ability to make genes and
then the gene products to order. Developmental biology now looks first for How quickly will this happen? It is happening today: the databases now
a gene to specify a form in the embryo. Cellular biology looks to the gene contain enough information to affect the interpretations of almost every se-
to specify a structural element. And medicine looks to genes to yield the quence. If a new sequence has no match in the databases as they are, a week
body’s proteins or to trace causes for illnesses. Evolutionary questions - later a still new sequence will match it. For 15 years, the DNA databases
from the origin of life to the speciation of birds - are all traced by patterns have grown by 60 per cent a year, a factor of ten every five years. The
on DNA molecules. Ecology characterizes natural populations by amplify- human genome project will continue and accelerate this rate of increase.
ing their DNA. The social habits of lions, the wanderings of turtles and the Thus I expect that sequence data for all of the model organisms and half
migrations of human populations leave patterns on their DNA. Legal issues of the total knowledge of the human organism will be available in five to
of life or death can turn on DNA fingerprints. seven years, and all of it by the end of the decade.
And now the genome project contemplates working out the complete DNA To use this flood of knowledge, which will pour across the computer net-
pattern and listing every one of the genes that characterize all of the model works of the world, biologists not only must become computer- literate, but
species that biologist study - ourselves even included. also change their approach to the problem of understanding life.
At the same time, all of these experimental processes - cloning, amplifying The next tenfold increase in the amount of information in the databases
and sequencing DNA - have become cook-book techniques. One looks up a will divide the world into haves and have-nots, unless each of us connects
recipe in the Maniatis book, or sometimes simply buys a kit and follows the to that information and learns how to sift through it for the parts we need.
instructions in the inserted instructional leaflet. Scientists write letters be- This is not more difficult than knowing how to access the scientific litera-
moaning the fact that students no longer understand how their experiments ture as it is at present, for even that skill involves more than a traditional
really work. What has been the point of their education? reading of the printed page, but today involves a search by computer.
The questions of science always lie in what is not yet known. Although We must hook our individual computers into the worldwide network that
our techniques determine what questions we can study, they are not them- gives us access to daily changes in the database and also makes immedi-
selves the goal. The march of science devises ever newer and more power- ate our communications with each other. The programs that display and
ful techniques. Widely used techniques begin as breakthroughs in a single analyze the material for us must be improved - and we must learn how to
laboratory, move to being used by many researchers, then by technicians, use them more effectively. Like the purchased kits, they will make our life
then to being taught in undergraduate courses and then to being supplied easier, but also like the kits we must understand enough of how they work
as purchased services - or, in their turn, superseded. to use them effectively.
Fifteen years ago, nobody could work out DNA sequences, today every The view that the genome project is breaking the rice bowl of the individual
molecular scientists does so and, five years from now, it will all be pur- biologist confuses the pattern of experiments done today with the essential
chased from an outside supplier. Just this happened with restriction en- questions of the science. Many of those who complain about the genome
zymes. In 1970, each of my graduate students had to make restriction en- project are really manifesting fears of technological unemployment. Their
zymes in order to work with DNA molecules; by 1976 the enzymes were hard-won PhDs seem suddenly to be valueless because they think of them-
all purchased and today no graduate student knows how to make them. selves as being trained to a single marketable skill, for a particular way of
Once one had to synthesize triphosphates to do experiments; still earlier, of doing experiments. But this is not the meaning of their education. Their
course, one blew one’s own glassware. doctorates should be testimonials that they had solved a novel problem,
and in so doing had learned the general ability to find whatever new or old
Yet in the current paradigm, the attack on the problems of biology is viewed
techniques were needed; a skill that transcends any particular problem.
as being solely experimental. The ’correct’ approach is to identify a gene
To indicate how far this shift has occurred, a famous author had the temerity to publish an article in Cell with the title
“Sequence first. Ask questions later”.
There is now a new concept of public data. Everyone that desires access can retrieve this data. This includes not only
scientists and medical practitioners but also private companies and members of the general public. The data is also raising
a large number of ethical problems that have not been fully considered.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 61
These advances have combined to create a new field of science. This is called bioinformatics (along with its relative –
medical informatics). It is, basically, a mixture of computer science, mathematics, and biology. It combines aspects from
all three fields to study the methods and the problems associated with the task of bringing information to a researcher,
sorting this mass of information in a meaningful way, and then analyzing it.
Our concern in this section will be focused on the databases of relevance to molecular biology. However, you should be
aware that this is but the tip of the iceberg – there are many databases of many natures. There are other biological databases
such as some of a biochemical nature, one on enzyme kinetics, some of a more general nature, and some just plain weird.
The major databases for molecular biology
are centered around the molecular sequence
databases. The genome projects supplying
these databases promise to yield the great-
est mass of data that biology has ever seen.
The human genome alone covers 3 billion nu-
cleotides. In February of 2001, the comple-
tion of the human genome draft sequence was
jointly announced by the private company Cel-
era Genomics and the publically funded Hu-
man Genome Project. This represents an enor-
mous accomplishment and will probably repre-
sent the biggest achievement since the discov-
ery of the structure of DNA.
But a single human genome (and a mosaic of
several individuals at that) was only the begin-
ning. In June 2012 there were over 1000 dis-
Figure 4.1: The completion of a first draft of the human genome was
tinct human individuals completely sequenced
announced in February of 2001
as part of a large project; the 1000 human
genome project. But this is just 1000 genomes.
In 2016, a single article published in PNAS reported the “Deep sequencing of 10,000 human genomes”; just one paper.
And there is no reason to stop at humans. There are many other eukaryotes whose genome has been sequenced and even
more in the pipeline. Currently in the public domain there are many more whole eukaryotic genome shotgun component
projects nearly completion. These genomes include fish, nematodes, insects, birds, mouse, rats, cows, plants, and many
more to come.
This mass of data also presents many problems – how do you store all of this information, how do you access it, and
move it? The rate of accumulation of sequence data is exponentially growing. This has been partly due to the fact that the
technology to carry out DNA sequencing has rapidly advanced. Today, almost the entire job can be carried out by robots
– from an input of tissue, the robots can automatically extract the DNA, amplify regions of interest, and prepare sequence
cocktails. These are then loaded onto the gels of automatic sequencing machines. These machines will run the gels, a laser
will scan the gels and calculate the DNA sequence in the case of Sanger sequencing or in the case of the other machines
the similar automated processes occur with sequence reads automatically entered in computer files. Finally, the sequence
is often automatically passed on to computer clusters for preliminary analysis and these computers might automatically
assemble the fragments, search/compile databases, or other analyzes. In addition, as the cost goes down, the number of
laboratories that routinely sequence DNA has increased.
The result of this increased activity is shown in Figure 4.2. Some of this data is annotated but since 2004 EMBL has
included in its data releases nucleotides of mostly unannotated whole genome shotgun data. Over 692 billion of the
nucleotides in the database come from this and other raw data sources. The current (June 2016) official EMBL release 128
yields over 1700 billion nucleotides and is the data that is plotted Figure 4.2. The rate of growth of the data has been close
to exponential for many years. Obviously an exponential growth cannot continue (physical laws prevent this). However, I
have stated this every year since I (BG) first taught this course in 1990. I think that finally (after many years) I have been
proven correct and physical laws have taken hold (hmmmm, maybe not; comparable figures from other years are 1993,
1995, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016,
2017). (How can one tell if it is exponential? Well a simple way is to see if the data is a straight line when plotted on a
Elementary Sequence Analysis
62 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
log scale. Take a look at a log plot and see if you think that these data are linear). Perhaps a definite slow down in the rate
within EMBL is seen but the rate is still exponential at a somewhat slower pace.
The slow down is, however, not so much due to the mundane reason that such things are physically impossible. No indeed,
the advances in sequencing technology have more than kept pace as evidenced by the previous chapter. Rather it is because
the cost of sequencing has been reduced and a new type of DNA sequencing project is emerging to become more and more
common. Resequencing of an already known genomes was an unthinkable cost just a few years ago. But by July 2008,
36 strains of Saccharomyces cerevisiae had been completely sequenced, fifteen strains of mice, and three nematodes. Such
resequencing data is often not entered into the databases and indeed, for some bacterial genomes the resequence data from
thousands of genomes might even be considered to be too cheap to be worth the effort of finishing the data and trying to
get it submitted.
More large sequence projects beyond the human genome are in progress. The sequencing of the entire genomes of 1000 hu-
mans is completed and the data is now public record with a devoted NCBI browser. There are now large (multi-institutional)
projects to sequence 10,000 eukaryotic genomes; projects to sequence human centenarians; projects to sequence the entire
human microbiome (all the bacteria, fungi and microbes associated with humans); a total of 100,000 bacterial genomes
(!!).
We now enter the new realm of population genomics. No longer is it suitable to sequence the entire genome of a species.
It is now feasible and academically intriguing to sequence all of the genomes of a population or of a community. So
for example, the NCBI pathogens database (https://www.ncbi.nlm.nih.gov/pathogens/) contains data on the genomes of
bacterial pathogens; for Salmonella there are 139,754 entries!
Such data are unlikely to find their way into the databases in their entirety. But regardless, all of this mass of data is open
for analysis and is a rich research field. It is now stored (at NCBI; below) in a separate archive called the Sequence Read
Archive (SRA). In August 9, 2018 it contained 19,469,647,048,078,497 total bases (19 peta-bases). Even the name of
this database has had to change. With the growth of sequencing technology, the original name “Short Read Archive” was
no longer appropriate with the growing lengths of reads. It’s growth is shown in Figure 4.3. Note that, on this scale, the
database didn’t exist just a few years ago.
There are three major nucleotide sequence databases. These are EMBL (European Molecular Biology Laboratory), NCBI
(the U.S. National Center for Biotechnology Information) and DDBJ (the DNA Data Bank of Japan). Each of these
databases attempt to collect all of the known nucleic acid (DNA/RNA) sequences. The sequences were collected from
published sources and most journals now require submission of the sequences to a database before publication is permitted.
Many sequences are directly deposited into the databases and will not be published in any other form. In addition to the
sequences, the databases also contain many other useful bits of data, including (but not limited to) organism, tissue,
function, and bibliographic information.
All three of these organizations are in electronic contact with each other and exchange sequence information daily. Hence,
you need not worry that one database might not have the sequence of interest but a search of some other database would
have it (at least in theory, anyway).
The following sections are intended to give you a flavour of the database contents.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 63
3600
3600
3100
3100
1600 2100 2600
700 1100
300
300
0
0
1980 1985 1990 1995 2000 2005 2010 2015 2020
Date
4.2 N.C.B.I.
The easiest way to explain what is contained in the database is to examine an actual entry from the data
base. This is shown below. This example contains the nucleotide sequence of the first exon of the human
lung adenocarcinoma (PR310) c-K-ras oncogene.
Note that the actual sequence information provided at the end of the entry may be, as in this case, only a small fraction of
the total data entry. NCBI organizes its entries onto several lines each of which begin with a special header. The first header
and that which always begins the entry is the LOCUS name. This provides a identifying code word (in uppercase) to be
associated with this entry. It also gives the length of the sequence and the date the sequence was entered or last modified.
The next line contains a short DEFINITION of the sequence that is contained in the entry. An ACCESSION number is a
unique identifying sequence for this data entry. Note that only the accession number will necessarily be constant across
nucleotide databases (NCBI, EMBL, DDBJ). The accession numbers are unique among these three nucleotide databases
but are not necessarily unique between other databases (e.g. between protein and nucleotide databases). LOCUS names
are variable and can be changed. The LOCUS names are often changed as nomenclature is changed or as sequences are
merged into larger entries. The ACCESSION number and the LOCUS names are two character strings that can be easily
used to access and retrieve sequence entries from NCBI.
Following these, come KEYWORDS that identify the particular entry. The SOURCE line describes how the sequence was
cloned/sequenced and the ORGANISM line describes the species/construct from which the sequence originates. Following
this, is a description of REFERENCES that deal with this entry. Note that this would include only the original papers
describing the sequencing and not any other subsequent papers that might analyze the sequence. Multiple references will
be given when different labs have sequenced the same DNA or when different publications describe different parts of
the sequence. Throughout the NCBI database, numbers in square brackets indicate items in the REFERENCE list. The
STANDARD describes any checks on the accuracy of the sequence.
The FEATURES section will describe things such as coding sequence start/stop, leader sequence start/stop, presence of
signal sequences, locations of exons/introns, repeats, polymorphisms, and so on. There may also be comments in this
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 65
section that can be useful. It may describe the some of the interesting facts that may go along with this sequence. This
might include why it was sequenced, how it relates to other sequences in the database, some unusual features of the
sequence, etc.
The BASE COUNT line gives the proportions of each nucleotide in the sequence. The ORIGIN line gives details of where
the sequence starts relative to restriction sites (or other location markers) that aided the cloning.
Finally the sequence follows in lower case, in groups of 10 and with the number of the first nucleotide given on the left.
The above example entry is a particularly short sequence. This is not the norm for NCBI entries. Most entries contain a
longer sequence as shown in the example below.
This is still a very short entry. Many of the entries originate from complete genomes (or exceptionally long contigs that
when joined together represent a complete genome). These can of course be many millions of nucleotides long.
As of May 2015, there were one hundred seventy eight eukaryotic organisms listed as completely sequenced at http:
//www.ebi.ac.uk./genomes/eukaryota.html. These include the “lab-rat” yeast Saccharomyces cerevisiae
and other fungi, single celled protists, the nematodes Caenorhabditis elegans (and C. briggsae), the plants Arabidopsis
thaliana and Oryza sativa, twelve complete fruit fly genomes (including the ubiquitous Drosophila melanogaster), the
mosquito Anopheles gambiae, fish (Danio rerio and Tetraodon nigroviridis), the chicken Gallus gallus, mouse Mus mus-
culus, the rat Ratuus norvegicus, man’s best friend Canis familiaris, the Rhesus macaque Macaca mulatta, the chimpanzee
Pan troglodytes and Homo sapiens. Many particularly interesting genomes have been done such as that of the duck-billed
platypus (Ornithorhynchus anatinus; Nature 2008 453:175-83). However this listing is horribly out of date, many com-
pleted genomes are not listed at all and the listing has not been updated since May 2015. The humans maintaining the
database entries can’t keep up with the automated sequencing machines.
The prokaryotic genomes are sequenced more rapidly since they are much smaller and as of August 2015, there have been
at least 202 archaeal and 3316 bacterial genomes completely sequenced (the first being Haemophilus influenzae in 1995
coming in at 1,830,140 bp.) and there are many more being sequenced and many new genomes completely sequenced now
don’t even make it into the databases.
In the past, full genome sequences were reported as major milestones of achievement in journals such as SCIENCE and
NATURE. In the last 1.5 months of 1997, the journal NATURE published five issues. Of these, three issues reported
the completion of three different bacterial genomes (Bacillus subtilis, Archaeoglobus fulgidus, and Borrelia burgdorferi).
Today, only a few genomes are still reported in high profile journals with others published in more specialized journals
and some are simply published online. The rate of genome completion is rapidly speeding up and as more genomes are
completed, the news worthiness of a single genome (or even dozens) tends to diminish. But their utility increases with each
one determined.
In addition to the complete genomes, there is a long list of viruses and organelles (e.g. see also the OGMP - organelle
genome megasequencing program) that have been completely sequenced including the CMV DNA virus (300,000 bp), the
Epstein-Barr virus genome (172,282 bp), the AIDS virus (9,737 bp), human mitochondria (16,569 bp), human leukaemia
virus type I (9,032 bp), lambda (48,502 bp), PhiX174 (5,386 bp) and more.
Elementary Sequence Analysis
68 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
4.3 E.M.B.L.
The same entry as in Example #1 above is also present at EMBL (as they all should be). It’s
format is shown in Example #3.
Note that the entry contains the same information but in a slightly different form. In this case, the data is more structured
with defined prefixes at the beginning of every line. This difference can be useful if you wish to write your own code to
analyze some features of this data.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 69
The EMBL databases have moved from Heidelberg, Germany to Hinxton Hall (Cambridge, England). But many of the
people doing protein analysis still exist at Heidelberg and throughout the EMBL organization. For more information about
the EMBL database check out their web site or send e-mail to [email protected].
4.4 D.D.B.J.
For the particular entry chosen in Example #1, the DDBJ format is essentially equivalent (there
are minor differences). It is ...
The DDBJ began in 1986 and is operated from grants from the Japanese Ministry of Education, Science and Culture. For
more information go to their web site.
4.5 SwissProt
These are the three major nucleotide databases, but there are also a large number of protein se-
quence databases. Again many of these databases are very large. For example, release 2018 07
of UniProtKB/Swiss-Prot (18 July 2018) contains 557,992 annotated entries containing a total
of 200,270,360 amino acid residues. There are more than 1,269 entries having proteins larger
than 2500 residues including the absolutely massive human nebulin protein of 6669 amino acids,
human nesprin 1 protein of 8797 amino acids, the Caenorhabditis elegans mesocentin protein of
13100 amino acids and the mouse titin protein of 35213 amino acids (currently the largest protein
in SWISS-PROT). The entries in this database are similar to the nucleotide databases of EMBL.
Two examples are shown below.
Elementary Sequence Analysis
70 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Again, various features are on individual lines – the identification line (ID) giving a locus name, the accession number
line (AC), the date of entry (DT), a description of the entry (DE), a line specifying the organism (OS), the organism’s
phylogenetic classification (OC), lines describing the reference number, author and location (RN, RA, RL), the comment
lines (CC), a database reference line (DR) to cross reference the entry to other database entries, the keyword line (KW),
the feature tables (FT) and the sequence header (SQ) giving length in aa, molecular weight and a checking number defined
in A. Bairoch, J. Biochem. 203: 527 (1983).
In addition to these protein databases, there are databases devoted to particular families of pro-
teins and to particular organisms. In addition there are protein databases constructed from trans-
lations of the nucleotide databases – NCBI’s is called GenPept and EMBL’s is termed TrEMBL
(release 2018 07 of UniProtKB/TrEMBL (18 July 2018) combines the translated EMBL nucleotide
database, the Genbank database and Swiss-Prot to yield 120,243,849 sequence entries comprising
40,506,871,635 amino acids). The best access for the SwissProt database is through the UniProtKB
web site (or the ExPASy (Expert Protein Analysis System) web site).
Another protein sequence database of interest is PIR (Protein Information Resource) sponsored by NBRF (National
Biomedical Research Foundation) at Georgetown University. This database of protein sequences is completely cross-
referenced to known nucleic acid sequences, has data on x-ray crystallography and active site determination, and is fully
annotated. The last release of this database was on Dec 2004 as it is now integrated into UniProt.
In an effort to combine the information in these disparate protein databases, the UniProt database was constructed. It joins
the information contained in Swiss-Prot, TrEMBL, and PIR. UniProt (Universal Protein Resource) claims to be the world’s
most comprehensive catalog of information on proteins. It is divided into three parts: UniProt Knowledgebase (UniProtKB)
is the central access point for extensive curated protein information, including function, classification, and cross-reference.
The UniProt Reference Clusters (UniRef) databases combine closely related sequences into a single record. The UniProt
Archive (UniParc) is meant to be a comprehensive repository for all protein sequences.
Elementary Sequence Analysis
72 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Species EMBL rel 124 EMBL rel 104 EMBL rel 83 EMBL rel 63 EMBL rel 44 IG rel 63
Jun 2015 Jul 2010 Aug 2005 Jun 2000 Aug 1995 Jun 1990
Constructed 864785 - - - - -
ESTs 42497 36234 14544 1641 100 -
GSSs 25423 18322 7155 838 - -
HTC 629 658 422 - - -
HTG 25528 24132 11490 3756 - -
HTG0 - - 510 - - -
Patents 15272 8191 1450 - - -
Standard 72562 - - - - -
STSs 640 634 492 51 6 -
TSAs 65576 - - - - -
WGS 1078521 172426 - - - -
Table 4.2: The Most Sequenced Organisms (DDBJ:release 92, March 2013 vs. GB:Rel. 104 1997)
growing groups were fungi sequences. It is amazing to see the growth reported in just a few years.
The top twenty organisms (excluding chloroplast and mitochondrial sequences) as of Jun 2012 are shown in Table 4.2. Note
that humans have already been done several times over. Also note that bacterial sequences can no longer make it among the
top twenty. Even with multiple sequencing of their genomes, they do not contain the quantity of sequence that is present
in eukaryotes. However, there are three unusual entries that now make it into the top 20; the “marine metagenome”,
“uncultured bacterium” and “unknown”. These are sequences that arise from what is known as environmental sequencing.
In these analysis, DNA is simply isolated from sea water or from soil or from some other natural environment. It is isolated
as raw DNA without concern from whatever organism it originates from and is simply sequenced. These sequences can
then be built into contigs and whole organisms can be sequenced without ever knowing what you have actually sequenced!
Since some organisms are more dominant over others in sea water, their genomes are sequenced many times.
Besides the data itself, most of the databases also maintain various index files. These include indices of authors, journals,
organism, etc. all cross-referenced by accession number or locus name. Again these can be very helpful to analyze the
entries of the databases.
bacterial protein synthesis, and that have significantly different sequences or structures from those in humans. But if one
database describes these molecules as being involved in ’translation’, whereas another uses the phrase ’protein synthesis’,
it will be difficult for you - and even harder for a computer - to find functionally equivalent terms”. The Gene Ontology
(GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.
This is quickly becoming a standard that must be used in the annotations of new genomes.
The PDB (Protein Data Bank), sponsored by Rutgers University, contains the 3-D atomic coordinates from x-ray diffraction
or NMR studies. The database also contains secondary and other structural features such as bond connectivity data. The
individual database entries are usually directly suitable for entry into 3-D rendering programs.
The PROSITE database is very useful. It lists the distinct structural motifs in proteins. This
includes amino acid post-translational modifications, topogenic sequences, domains of specific
biological function (e.g. DNA binding domains), enzyme active sites and signature patterns that
are specific to a family or group of proteins. For example, it lists the Kringle domain signature as
(Y,F)-C-R-N-P-D; a triple-looped, disulfide cross-linked region found mostly in serine proteases
and plasma proteins. For more information on this database, send e-mail to the EMBL databases.
Some databases are built on filtered data and are ‘value-added’ derivatives of these basic sequence databases. For example
the COG database holds aligned clusters of orthologous proteins. There are proteins collected from 43 completed genomes
and then compared “all-against-all” to yield 3307 clusters of related proteins (as of August 2002). One of the tools
associated with this database is the COGnitor program, which will assign query proteins to pre-existing clusters (and
hence usefully identify its functional category).
Besides these sorts of databases, there are also databases containing different types of information. The GDB (Genome
Database) contains mapping information of the human genome project. It contains information on the location of genes,
DNA segments, expressed sequence tags (EST’s), clinical phenotypes, polymorphisms and alleles, probes, CEPH reference
family data markers, etc. As part of this database, Victor McKusick’s original Mendelian Inheritance in Man (OMIM)
has been made available as an online, freely available computer database. This database lists clinical disorders or traits
in man, gene names, clinical observations, inheritance patterns, allelic variations, chromosome locations, linkages and so
on. The GDB and OMIM and most of the other molecular biology databases are all cross-linked. These databases are
maintained by the Welch medical library at Johns Hopkins.
Similar to the GDB, is the database for the mouse genome MGI: Mouse Genome Informatics. This
again contains mapping information of much the same sort as the human database. However, it also
includes homologies for mice, humans and 23 other species. Thus, if you are interested in a gene
on chromosome 11 in mice, you can find out where it has been located in some other species (the
references to the papers showing this, what other genes are similarly located and so on).
Pick an organelle and again you will find specialized databases. For just the mitochondria try the comprehensive MITOP
web site, or Human Mitochondrial Protein database a database of Mendelian Inheritance and the Mitochondrion db (mito-
chondrial nuclear genes) or MITOMAP a database for the human mitochondrial genome. If a whole organelle is too large
for your tastes, how about picking something smaller like a database for just part of the mitochondria, the hypervariable
control region at HvrBase or how about the weird on wonderful inteins. Don’t know what inteins are? Check out that link.
Recently there have been more projects to establish databases for the depo-
sition of microarray gene expression data. The NCBI version of this data
is housed in Gene Expression Omnibus (GEO). GEO is a gene expression
and hybridization array data repository, as well as an online resource for the
retrieval of gene expression data from any organism or artificial source. The
EBI microarray ArrayExpress Database is a similar database to store and
permit the query of microarray experiments.
There are many more databases. I have only given you a taste of some of the major databases. In addition to each of these
major databases, there are databases on each of the organisms that have major genomics projects
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 75
• E. coli,
• Yeast,
• Arabidopsis,
• Mouse,
• Cattle,
• Drosophila,
• Caenorhabditis elegans,
• Fungi,
• Maize,
• Rice,
• Grasses - Gramene,
• HIV,
and so on. So pick your favorite organism and do a search for it – there will be a web site devoted to it’s genome (so long
as it is not too unusual).
There are many other databases on diverse aspects such as the
• CEPH-Genethon human physical map data and Genethon database provide a connection between the physical map
of the human genome and the genetic/sequence data;
• a human cDNA database;
• genome sequencing centers such as Baylor College of Medicine and the Sanger Center maintain their own databases
on projects they are working on;
• an immunogenetics database Immuno Polymorphism Database (IPD);
• the Database of Expressed Sequence Tags dbEST;
• the Database of Sequence Tagged Sites dbSTS;
• the Eukaryotic Promoter Database EPD;
• a database of 3D-diagrams of proteins;
• BMRB (BioMagResBank) a database of NMR Spectroscopy data;
• CCDC (Cambridge Crystallographic Data Centre);
• HIC-Up (Hetero-compound Information Centre Uppsala); a database of small molecules commonly found associated
with larger molecules when their 3D-structure is determined
• HIV Drug Database;
• HIV Structural Database;
• Library of Protein Family Cores;
• NDB (Nucleic Acid Database);
• Prolysis: A Protease and Protease Inhibitor Web Server;
• Protein Kinase Resource;
• Protein Motions Database (see Figure 4.4);
• RELIBase;
Elementary Sequence Analysis
76 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 4.4: Proteins are not static sequences, they move and there is a database
devoted to this subject. The movement of the actin protein is shown
here. From the Protein Motions Database.
This list goes on and on (and increases each month). Choose what you are interested in ... chances are others are interested
as well and have built a database.
4.8.1 Entrez
The premier method that should be mentioned is probably the one method that you will use more than any other. This is a
NCBI project termed ENTREZ. This program can search across databases or natively through
One of the unique features of ENTREZ is that it was the first molecular biology database to incorporate links from one
type of data (e.g. nucleotide) to the others (e.g. to proteins via translations, to MEDLINE entries via their MeSH numbers
(NLM’s Medical Subject Headings)). In addition, it incorporates an algorithm that identifies related entries in the databases.
By related, we might mean genes in the same multigene family, or articles written about genes that have the same function,
other proteins that function in the same biochemical pathway. Hence besides requesting the sequence for something, you
can also find “everything else like this one”. Thus, you can request MEDLINE abstracts of papers that are on similar or
related topics (without any prior knowledge of their existence). Besides these “soft-links” via MeSH numbers there are
also hard links encoded in the database that relate the abstract of the paper that reported the sequence or the protein entry
of the nucleotide sequence.
When using the ENTREZ program, as with any web search engnine you must learn how to limit your search and to perform
(as far as possible) formated queries. In the web based form of the ENTREZ program this is done through the “Limits”
tab and the “Preview/Index” tab located just below the query entry box. These tabs permit the search to be restricted to
individual organisms, to particular features and so on. They permit previous queries to be combined with logical operators
such as AND, OR, NOT. The “neighbor” tab will be found among one of the many menu items that make this program a
very powerful search engine. These are best explored through actual use.
The search fields that are available are shown in Table 4.3. These fields can be entered directly into the query search as
“(adh OR mdh) AND Drosophila [ORGN] AND 1000:5000 [SLEN]” for example. The square brackets limit the previous
term to the designated field – in this case search for the word Drosophila only in the organism field (ORGN). But the adh
and mdh terms are searched in all fields by default. The range operator ‘:’ is permissible with the ACCN, MOLWT, and
SLEN fields. The boolean terms are AND, OR, NOT — they must be in upper case and can be combined with brackets
( ‘(’,‘)’ ) to clarify meaning.
Elementary Sequence Analysis
78 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
The search fields for Entrez PubMed are slightly different from these and are shown in Table 4.4. The boolean operators
(“AND”, “OR”, etc.) are the same and all of these can be combined to yield a highly structured query. The documentation
for PubMed can be found at http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html.
The ENTREZ program is also released as a standalone program free of charge and you can download them to your computer
from the NCBI ftp site.
DATALIB gb
TITLES
MAXDOCS 30
BEGIN
BOVPRL
J02459 [ACC]
The DATALIB must be gb or genbank (GenBank), gbu or gbupdate (only updates since the last release), gbonly (official
release only), emb or embl (EMBL), emblu or emblupdate (only updates since the last release), emblonly (full release
only),sp or swiss or swissprot (Swiss-Prot), spu or swissprotupdate (updates only), pir (PIR database), omim (OMIM),
vector (vector sequences), gp or genpept (translated GenBank), gpu or gpuupdate (updates only), kabatnuc (immunological
nucleotide sequences), kabatpro (immunological protein sequences), and, though not stated in the official documentation,
MEDLINE also works.
TITLES will display only the title of the matching record. MAXDOCS/MAXLINES restrict the volume of returns. Only
DATALIB and BEGIN are mandatory. The above will search for records with LOCUS titles “BOVPRL” or accession
number J02459. (NOTE: to put an underscore in the search, enclose the locus name in double quotes).
This retrieval service permits boolean searches. A logical OR is the implied default - as above, BOVPRL or J02459. But
a logical AND and a logical NOT can be added to the query. Hence, “BOVPRL AND J02459” will retrieve records with
both BOVPRL and J02459 in the record. The queries can be constructed with parenthesis to group items and with asterisks
to match anything. For example, “(lysine OR glutamine) NOT vitelloge*”. The field restrictor [ACC] restricts J02459 to
be located in the accession number field. The field restrictors (the three letter codes) for the major databases are:
get nuc:pip03xx
get nuc:x03392
get prot:kap_yeast
This will get the sequence with accession numbers pip03xx and x03392 from the nucleotide databases and the protein
sequence with locus name kap yeast from the protein database.
4.8.4 Others
There are, again, many other database access tools. For example, there is the DBGET system. This is run out of Japan
(the Supercomputer Laboratory (SCL) in Kyoto and the Human Genome Center (HGC) in Tokyo). Once again, this search
engine can find relevant data from several databases, including:
Each of the databases can have individual access tools that can provide more specialized access. For example, the PDB
database supports viewing of protein structures via VRML (virtual reality modelling language), Rasmol (a freely available
program for displaying molecules in three dimensions), FirstGlance and Protein Explorer (two other programs that require
a commercial product), and via still photographs. There is also a special browser for the SWISS 3DIMAGE database and
so on.
For each database, look for a specialized browser to access the data making use of the peculiarities of the data stored.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 81
4.9 Reliability
The data within the databases may not always be what it pretends to be. This venture is a human one and humans make
mistakes. Indeed, the venture relies on the contributions of many people and they all have different standards of accuracy.
One of the most common errors in the early days was the presence of vector sequence in the midst of some other sequence.
Today this is not such a large problem since most entries are now automatically screened against known vectors and the
error can be caught before the sequence makes it into the databases.
Smaller errors in sequences are also common. The human APRT gene sequence was determined and entered into the data
base by one laboratory. A few months later, another laboratory published a paper with a sequence that differed from the
previous work by 13 nucleotides and 60 insertions/deletions over 3 kb. It is impossible to tell how much of this may be
due to polymorphism and how much may be due to actual sequencing error. Because this kind of duplication is not done
for every sequence it is impossible to say that this is typical or atypical of the sequencing done. However, as a counter
example, a check of the yeast genome revealed only a couple of differences over many megabases (H. Bussey - personal
communication).
Unfortunately, many errors are not easily corrected. Current policy for most of the databases is that the people running
them are responsible for the database en masse while the actual data is the business of the researchers. Hence the databases
are meant to act as an archive and unless the original author corrects their data, it will remain archived in the database.
On a more positive note, you will also find many entries that were created long ago and yet, last modified very recently to
incorporate the latest information. Further more, many of the databases devoted to one organism will gather and carefully
curate the data.
Still another problem that has shown up is trivial data entries. The following entry was noticed by Reinhard Doelz.
Database Silliness
This is truly an amazing entry. It is fully six nucleotides long, it comes from an unknown source, it comes from an
unknown organism and from unclassified sequences. But it is patented !! What could possibly be the purpose of entering
this sequence in the database and even more incredulously, why would one ever patent it? By random chance your DNA
must contain this sequence. Since it does and the sequence was patented, be forewarned that you should obtain correct
written permission from the patent holders before you replicate it again. More seriously, if you construct oligos for PCR
or sequencing, you are probably guilty of patent infringement. Reinhard calculated that this silly hexanucleotide occurs
28340 times within the database and in over 70000 sequences (circa 1993). This entry is perhaps extreme but there are
other, less extreme entries of equally doubtful quality.
Elementary Sequence Analysis
82 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
The take home message from all of this is to look at the data with a critical eye. The actual quantity and type of errors
within the databases are not known - some researchers are very careful and can check their sequence data, for others a
double check of the sequence data may not be possible. When doing your own research, assume that the sequence may
contain some errors and take measures to prevent this from destroying the validity of your conclusions.
Caveat Emptor
Chapter 5
There are many formats that sequence data can be presented in. Each has advantages over the others (e.g. some are small
and compact; others contain lots of information) and different programs require different formats as their input. The major
databases permit sequences to be stored on your local computer in more than one format and there are programs that will
convert one format to another. The most popular of these is a program called readseq by D.G. Gilbert (available for
UNIX, DOS and APPLE machines).
The GenBank and EMBL formats have been discussed above. Both the GenBank and EMBL formats are highly styl-
ized and strictly controlled to conform to consistent standards. Other popular formats are ASN.1, DNAStrider,
Fitch, GCG, GDE, HENNIG86, IG/Stanford, MSF, NBRF, NEXUS. PIR/CODATA, Pearson/Fasta, Phylip
- Interleaved, Phylip - Sequential, and Plain/Raw, I will not present all here but rather just a smattering.
Most formats will ignore case and this can therefore often be used to add information about the sequences. While the
GenBank and EMBL formats can contain the character ‘-’, they generally do not contain these characters and these formats
were not intended to convey the kind of information that includes homologous sites between multiple sequences (the dashes
indicate conceptual gaps in the sequences that have been inserted so that homologous parts of the sequence from each
species are in the same location).
5.1 Genbank/EMBL
As a quick review, these two formats would be . . .
/gene="APRT"
CDS join(46..125,256..362,1509..1642,1847..1925,2044..2186)
/gene="APRT"
/EC_number="2.4.2.7"
/note="purine salvage enzyme"
/codon_start=1
/product="adenine phosphoribosyltransferase"
/db_xref="PID:g881574"
/translation="MSESELKLVARRIRSFPDFPIPGVLFRDISPLLKDPDSFRASIR
LLASHLKSTHSGKIDYIAGLDSRGFLFGPSLAQELGVGCVLIRKQGKLPGPTISASYA
LEYGKAELEIQKDALEPGQRVVIVDDLLATGGTMFAACDLLHQLRAEVVECVSLVELT
SLKGRERLGPIPFFSLLQYD"
BASE COUNT 87 a 228 c 145 g 147 t
ORIGIN
1 cctgcggata ctcacctcct ccttgtctcc tacaagcacg cggccatgtc cgagtctgag
61 ttgaaactgg tggcgcggcg catccgcagc ttccccgact tccccatccc gggcgtgctg
121 ttcaggtgcg gtcacgagcc ggcgaggcgt tggcgccgta ctctcatccc ccggcgcagg
181 cgcgtgggca gccttgggga tcttgcgggg cctctgcccg gccacacgcg gtcactctcc
241 tgtccttgtt cccagggata tctcgcccct cttgaaagat ccggactcct tccgagcttc
301 catccgcctc ctggccagtc acctgaagtc cacgcacagc ggcaagatcg actatatcgc
361 agggcaaggt ggccttgcta ggccgtactc atcccccacg gtcctatccc ctatcccctt
421 tcccctcgtg tcacccacag tctaccccac acccatccat tctttcttta acctctgact
481 cttcctcctt ggtttctcac tgccttggac gcttgttcac cccggatgaa ctccgtaggc
541 gtctcccttc cctgcttggt accctaaggt gccctcggtg cttgttcgta gagacgaact
601 ctgctct
//
and
FT LASHLKSTHSGKIDYIAGLDSRGFLFGPSLAQELGVGCVLIRKQGKLPGPTISASYALE
FT YGKAELEIQKDALEPGQRVVIVDDLLATGGTMFAACDLLHQLRAEVVECVSLVELTSLK
FT GRERLGPIPFFSLLQYD"
XX
SQ Sequence 607 BP; 87 A; 228 C; 145 G; 147 T; 0 other;
CCTGCGGATA CTCACCTCCT CCTTGTCTCC TACAAGCACG CGGCCATGTC CGAGTCTGAG 60
TTGAAACTGG TGGCGCGGCG CATCCGCAGC TTCCCCGACT TCCCCATCCC GGGCGTGCTG 120
TTCAGGTGCG GTCACGAGCC GGCGAGGCGT TGGCGCCGTA CTCTCATCCC CCGGCGCAGG 180
CGCGTGGGCA GCCTTGGGGA TCTTGCGGGG CCTCTGCCCG GCCACACGCG GTCACTCTCC 240
TGTCCTTGTT CCCAGGGATA TCTCGCCCCT CTTGAAAGAT CCGGACTCCT TCCGAGCTTC 300
CATCCGCCTC CTGGCCAGTC ACCTGAAGTC CACGCACAGC GGCAAGATCG ACTATATCGC 360
AGGGCAAGGT GGCCTTGCTA GGCCGTACTC ATCCCCCACG GTCCTATCCC CTATCCCCTT 420
TCCCCTCGTG TCACCCACAG TCTACCCCAC ACCCATCCAT TCTTTCTTTA ACCTCTGACT 480
CTTCCTCCTT GGTTTCTCAC TGCCTTGGAC GCTTGTTCAC CCCGGATGAA CTCCGTAGGC 540
GTCTCCCTTC CCTGCTTGGT ACCCTAAGGT GCCCTCGGTG CTTGTTCGTA GAGACGAACT 600
CTGCTCT 607
//
In the Genbank format, sequence information is set aside with key words. The entire entry begins with the keyword
LOCUS at the beginning of a line and ends with //. Different features are set off with different keywords; the sequence
information itself with the keyword ORIGIN.
The EMBL format is similar but with two-letter codes at the beginning of each line to designate different features of the
entry (much easier to program). The entire entry begins with the key ID at the beginning of a line and ends with //.
5.2 FASTA
By far the simplest format is termed the fasta (also known as the Pearson format). This sequence format contains
the minimal amount of information. A fasta file will contain just a ‘>’ sign (at the beginning of a line) to indicate the
beginning of a new sequence and a word (phrase) to serve as the sequence title. The sequence information itself follows
immediately. No other information is stored within a fasta file. As an example, I will use a proportion of the Mus pahari,
Mus spicilegus and Gerbillus campestris APRT gene sequences. These sequences would appear as . . .
ATCCCGTGTTCCC---------TTTTTCGTGTCACCCACACCCACCCCTC
CTTTCTCTGACACTCCCAAGTTCCCT----GTTCCTCTCTGCCTTGGTCC
CATATTCACCCCGGATGA-CTGCGGAGTCTCCCACCCTCTGACCTCTGCT
CTCAAAGC----------CTGTCCCTAC---TAGAGAGGAACTCTGCTCT
Note that although it is a simple format, sequence alignment information (more on this later) can be indicated by the dashes.
5.3 FASTQ
This FASTQ format specification is modified from http://maq.sourceforge.net/fastq.shtml.
FASTQ format stores sequences and Phred qualities in a single file (Phred quality scores are so named after a popular
software package and have become the standard method to quantify the reliability of the base call). It is concise and
compact. FASTQ was first widely used in the Sanger Institute. Although Solexa/Illumina files look like FASTQ files, they
scale the quality scores differently. In the quality string, if you can see a character with its ASCII code higher than 90,
your file is probably in the Solexa/Illumina format. Just to make things more confusing Illumina created a third version
“Illumina 1.3+ FASTQ” format.
An example from work done at McMaster,
@HWI-EAS038:8:1:8:697#0/1
AGACTGGCTGGAGCATGTCTATGACGGACTATGATG
+HWI-EAS038:8:1:8:697#0/1
aaa‘[[‘a‘ˆ[ˆUˆ_YPU[[‘ZUˆVSTZVX_TBBBB
@HWI-EAS038:8:1:8:1326#0/1
AGACTACCGTGTCGTCACGACACGGTCGACGACCAC
+HWI-EAS038:8:1:8:1326#0/1
aˆaˆ\aa‘\ZUZVPV\‘SP\]aSPQSRNXWBBBBBB
@HWI-EAS038:8:1:8:1305#0/1
AGACTCGAAACGCCTTTCTGGAACACGAAAGGTCTC
+HWI-EAS038:8:1:8:1305#0/1
aXa‘_ˆ‘aaa_W[\\ˆˆˆˆVT]a_‘[Tˆ‘‘WSWˆW[
The FASTQ format specification comes in four lines. The first line begins with an ‘@‘ symbol and is followed by the
sequence name. The second line contains the base call (in this case for each of 36 nucleotides). The third line begins with
a ‘+‘ symbol and may (or may not) repeat the sequence name. The fourth line contains a symbol that measures the quality
score for the corresponding base call as listed on the second line. There should be one symbol for each base call. Another
read will follow with another four lines.
The symbol on the fourth line uses an ASCII character (American Standard Code for Information Interchange) to encode
the quality score. Part of an ASCII table is reproduced here.
Given a character q, the corresponding Phred quality score can be calculated with:
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 87
Q = ord(q) − 33
where ord() gives the ASCII code of a character. This is known as the fastq-sanger format.
Solexa/Illumina Read Format:
The syntax of Solexa/Illumina read format (also known as fastq-solexa) is almost identical to the FASTQ format, but the
qualities are scaled differently. Given a character sq , the following gives the Phred quality Q:
Hence for the example given above, the first nucleotide in the first read (sequence name ‘HWI-EAS038:8:1:8:697#0/1‘) is
an ‘A‘ (the first character on the second line) with quality score ‘a‘ (the first character on the fourth line). The ASCII code
for ‘a‘ is 97. Therefore we can infer that these are Solexa/Illumina reads, and that the corresponding Phred quality is
The third variant of the FASTQ format was implemented by Illumina after they bought out Solexa and simply uses an
offset of 64,
Q = ord(q) − 64
I have added a carriage return (a newline) in the middle of each entry (between the read sequence and the base quality
scores (see below)) for readability in this page format. They are not present in the actual file.
Header lines begin with a “@” symbol and display information about the chromosomes to which the mapping has been done
(SN; sequence name). In this case to two chromosomes designated “NC 009456” and “NC 009457” with corresponding
lengths (LN: sequence length). Other headers could include read groups (RG), program used (PG) or just simply a comment
(CO).
Each alignment is on a different line and each line has 11 mandatory fields in a specified order. Even if the information is
unknown or missing the tab-delimited field must be present (with either a ‘0’ or a ‘*’ to indicate missing data).
The eleven mandatory fields and optional fields are . . .
Elementary Sequence Analysis
88 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
The BAM format is a binary version of the SAM format. It is designed to improve data handling performance, to speed the
analysis and to reduce file sizes.
The file format must begin with a line that declares the format and the version being used. Currently it should be
# STOCKHOLM 1.0
This is then followed by either markup annotations or sequence alignments. The sequence alignments follow the format of
where <seqname> is the “sequence name” and the “//” indicates the end of the alignment. Sequence letters may include
any characters except whitespace. Gaps may be indicated by “.” or “-”. Wrap-around alignments are allowed in principle,
mainly for historical reasons, but are not used in e.g. Pfam. Wrapped alignments are discouraged since they are much
harder to parse. Hence this format is best adapted to protein sequences.
There are four types of alignment mark-up, indicated in the following manner.
Mark-up lines may include any characters except whitespace. Use underscore (“ ”) instead of space. Many different
“features” can be recognized or simply free text can be used. Some of the more interesting per-column (GR) annotations
are
#=GR
An example is,
# STOCKHOLM 1.0
#=GF ID CBS
#=GF AC PF00571
#=GF DE CBS domain
#=GF AU Bateman A
#=GF CC CBS domains are small intracellular modules mostly found
#=GF CC in 2 or four copies within a protein.
Elementary Sequence Analysis
90 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
#=GF SQ 67
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071
#=GS O31698/88-139 AC O31698
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRVPVYERS
#=GR O83071/192-246 SA 999887756453524252..55152525....36463774777
O83071/259-312 MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVAIVLDEY
#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEE
O31698/18-71 MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAIPVLDPS
#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH
O31698/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHH..HEEEEEEE....EEEEEEEEEEH
#=GC SS_cons CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH
O31699/88-139 EVMLTDIPRLHINDPIMK..GFGMVINN......GFVCVENDE
#=GR O31699/88-139 AS ________________*__________________________
#=GR_O31699/88-139_IN ____________1______________2__________0____
//
5.6 GDE
The GDE format can also contain alignment information but note that it may have an ‘offset’ value. This (often annoying)
feature permits a compact storage of sequence information at the tails of the sequence. An ‘offset’ of 36 means to insert
36 ‘-’ in front of the sequence in order to properly line it up with the other sequences. This format can also contain all
the information that is present in a GenBank format but does so simply as a ‘comment’ enclosed in quotation marks and
any information may appear within the comment field. The example Mus pahari, Mus spicilegus and Gerbillus campestris
APRT gene sequences in a GDE format would appear as . . .
{
name "MPU28721"
type "DNA"
longname Mus pahari
sequence-ID "U28721"
descrip "Mus pahari adenine phosphoribosyltransferase (APRT) gene, complete cds"
creator "Fieldhouse,D. and Golding,G.B."
offset 36
creation-date 1/31/98 14:18:24
direction 1
strandedness 1
comments "
NID g881573
KEYWORDS .
SOURCE shrew mouse.
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae;
Murinae; Mus.
REFERENCE 1 (bases 1 to 2283)
TITLE Rates of substitution in closely related rodent species
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 2283)
TITLE Direct Submission
JOURNAL Submitted (07-JUN-1995) Dan Fieldhouse, Biology, McMaster
University, 1280 Main Street West, Hamilton, ON, L8S 4K1, Canada
FEATURES Location/Qualifiers
source 1..2283
/organism=‘Mus pahari‘
/db_xref=‘taxon:10093‘
gene join(46..125,256..362,1509..1642,1847..1925,2044..2186)
/gene=‘APRT‘
CDS join(46..125,256..362,1509..1642,1847..1925,2044..2186)
/gene=‘APRT‘
/EC_number=‘2.4.2.7‘
/note=‘purine salvage enzyme‘
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 91
/codon_start=1
/product=‘adenine phosphoribosyltransferase‘
/db_xref=‘PID:g881574‘
/translation=‘MSESELKLVARRIRSFPDFPIPGVLFRDISPLLKDPDSFRASIR
LLASHLKSTHSGKIDYIAGLDSRGFLFGPSLAQELGVGCVLIRKQGKLPGPTISASYA
LEYGKAELEIQKDALEPGQRVVIVDDLLATGGTMFAACDLLHQLRAEVVECVSLVELT
SLKGRERLGPIPFFSLLQYD‘
BASE COUNT 485 a 696 c 590 g 512 t
"
sequence "CCTGCGGATACTCACCTCCTCCTT
GTCTCCTACAAGCACGCGGCCATGTCCGAGTCTGAGTTGAAACTGGTGGCGCGGCGCATC
CGCAGCTTCCCCGACTTCCCCATCCCGGGCGTGCTGTTCAGGTGCGGTCACGAGCCGGCG
AGGCGTTGGCGCCGTACTCTCATCCC-CCGGCGCAGGCGCGTGGGCAGCCTTGGGGATCT
TGCGGGGCCTCTGCCCGGCCACACGCGG-TCACTCTCCTGTCCTTGTTCCCAGGGATATC
TCGCCCCTCTTGAAAGATCCGGACTCCTTCCGAGCTTCCATCCGCCTCCTGGCCAGTCAC
CTGAAGTCCACGCACAGCGGCAAGATCGACTATATCGCAGGGCAAGGTGGCCTTGCTAGG
CCGTACTCATCCCCCACGGTCCTATCCCCTATCCCCTTTCCCC-TCGTGTCACCCACAGT
CTACCCCACACCCATCCATTCTTTCTTTAACCTCTGACTCTTCCTCCTTGGTTTCTCACT
GCCTTGGACGCTTGTTCACCCCGGATGAACTCCGTAGGCGTCTCCCTTCCCTGCTTGGTA
CCCTAAGG----TGCCCTCGGTGCTTGTTCGTAGAGACGAACTCTGCTCT"
}
{
name "MSU28720"
type "DNA"
longname Mus spicilegus
sequence-ID "U28720"
descrip "Mus spicilegus adenine phosphoribosyltransferase (APRT) gene,"
creator "Fieldhouse,D. and Golding,G.B."
offset 15
creation-date 1/31/98 14:18:24
direction 1
strandedness 1
comments "
NID g881575
KEYWORDS .
SOURCE Steppe mouse.
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae;
Murinae; Mus.
REFERENCE 1 (bases 1 to 2117)
TITLE Rates of substitution in closely related rodent species
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 2117)
TITLE Direct Submission
JOURNAL Submitted (07-JUN-1995) Dan Fieldhouse, Biology, McMaster
University, 1280 Main Street West, Hamilton, ON, L8S 4K1, Canada
FEATURES Location/Qualifiers
source 1..2117
/organism=‘Mus spicilegus‘
/db_xref=‘taxon:10103‘
gene join(67..146,278..384,1355..1488,1675..1753,1860..2002)
/gene=‘APRT‘
CDS join(67..146,278..384,1355..1488,1675..1753,1860..2002)
/gene=‘APRT‘
/EC_number=‘2.4.2.7‘
/note=‘purine salvage enzyme‘
/codon_start=1
/product=‘adenine phosphoribosyltransferase‘
/db_xref=‘PID:g881576‘
/translation=‘MSEPELKLVARRIRSFPDFPIPGVLFRDISPLLKDPDSFRASIR
LLASHLKSTHSGKIDYIAGLDSRGFLFGPSLAQELGVGCVLIRKQGKLPGPTVSASYS
LEYGKAELEIQKDALEPGQRVVIVDDLLATGGTMFAACDLLHQLRAEVVECVSLVELT
SLKGRERLGPIPFFSLLQYD‘
BASE COUNT 413 a 652 c 564 g 488 t"
sequence "TCGGGATTGACGTGAATTTAGCGTGCTGATACCTACCTCCTCCTT
GCCTCCTACACGCACGCGGCCATGTCCGAACCTGAGTTGAAACTGGTGGCGCGGCGCATC
CGCAGCTTCCCCGACTTCCCAATCCCGGGCGTGCTGTTCAGGTGCGGTCACGAGCCGGCG
AGGCGTTGGCGCCGTACGCTCATCCC-CCGGCGCAGGCGCGTAGGCAGCCTCGGGGATCT
TGCGGGGCCTCTGCCCGGCCACACGCGGGTCACTCTCCTGTCCTTGTTCCCAGGGATATC
TCGCCCCTCTTGAAAGACCCGGACTCCTTCCGAGCTTCCATCCGCCTCTTGGCCAGTCAC
CTGAAGTCCACGCACAGCGGCAAGATCGACTACATCGCAGGCGA--GTGGCCTTGCTAGG
CCGTGCTCGTCCCCCACGGTCCTAGCCCCTATCCCCTTTCCCCCTCGTGTCACCCACAGT
CTGCCCCACACCCATCCATTCTTTCTTCAACCTCTGACACTTCCTCCTTGGTTCCTCACT
GCCTTGGACGCTTGTTCACCCCGGATGAACTATGTAGGAGTCTCCCTTCCCTGCTAGGTA
CCCTAAGGCATCTGCCCTCGGTGCTTGTTCCTAGAGACGAACTCTGCTCT"
}
{
name "GCU28961"
type "DNA"
longname Gerbillus campestris
Elementary Sequence Analysis
92 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
sequence-ID "U28961"
descrip "Gerbillus campestris adenine phosphoribosyltransferase (APRT) gene,"
creator "Yazdani,F. and Golding,G.B."
creation-date 1/31/98 14:18:24
direction 1
strandedness 1
comments "
NID g899456
KEYWORDS .
SOURCE Gerbillus campestris.
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae;
Gerbillinae; Gerbillus.
REFERENCE 1 (bases 1 to 2076)
TITLE Rates of substitution in closely related rodent species
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 2076)
TITLE Direct Submission
JOURNAL Submitted (12-JUN-1995) Fariborz Yazdani, Biology, McMaster
University, 1280 Main Street West, Hamilton, Ont L8S 4K1, Canada
FEATURES Location/Qualifiers
source 1..2076
/organism=‘Gerbillus campestris‘
/db_xref=‘taxon:41199‘
gene join(81..160,289..395,1313..1446,1649..1727,1828..1970)
/gene=‘APRT‘
exon >81..160
/gene=‘APRT‘
CDS join(81..160,289..395,1313..1446,1649..1727,1828..1970)
/gene=‘APRT‘
/EC_number=‘2.4.2.7‘
/note=‘purine salvage enzyme‘
/codon_start=1
/product=‘adenine phosphoribosyltransferase‘
/db_xref=‘PID:g899457‘
/translation=‘MAEPELQLVARRIRSFPDFPIPGVLFRDISPLLKDPDSFRASIR
LLANHLKSKHGGKIDYIAGLDSRGFLFGPSLAQELGLGCVLIRKRGKLPGPTVSASYA
LEYGKAELEIQKDALEPGQKVVIVDDLLATGGTMCAACQLLGQLRAEVVECVSLVELT
SLKGREKLGPVPFFSLLQYE‘
intron 161..288
/gene=‘APRT‘
exon 289..395
/gene=‘APRT‘
intron 396..1312
/gene=‘APRT‘
exon 1313..1446
/gene=‘APRT‘
intron 1447..1648
/gene=‘APRT‘
exon 1649..1727
/gene=‘APRT‘
intron 1728..1827
/gene=‘APRT‘
exon 1828..>1970
/gene=‘APRT‘
BASE COUNT 385 a 666 c 577 g 448 t"
sequence "
CCTCCGCCCTTGTTCCTGGGACAGGCTTGACCCTAGCCAGTTGACACCTCACCTCCGCCC
TTCCTCT-CACGCACGCGGCCATGGCGGAACCCGAGTTGCAGCTGGTGGCGCGGCGCATC
CGCAGCTTCCCCGACTTCCCCATCCCGGGCGTGCTGTTCAGGTGCGTCCACGAGCCGCCC
AGGCGTTGGCGCTGCGTCCTCAGCCCTCCGGCGCAGGCGCGTGAGCTGTCTCCGGGATCT
TGCGGGGCCTCCGCCCAGCCATACCCAAGTCACCATCCTG----TGTTCCCAGGGATATC
TCGCCCCTCCTGAAAGACCCGGACTCCTTCCGAGCTTCCATCCGTCTCCTGGCCAACCAT
CTGAAGTCCAAGCATGGCGGCAAAATCGACTACATCGCAGGCGA--GTGTTCTTGCTAGG
CCGTGCCCGTTCCC-ACTGTCAGGGCCGCCATCCCGTGTTCCC---------TTTTTCGT
GTCACCCACACCCACCCCTCCTTTCTCTGACACTCCCAAGTTCCCT----GTTCCTCTCT
GCCTTGGTCCCATATTCACCCCGGATGA-CTGCGGAGTCTCCCACCCTCTGACCTCTGCT
CTCAAAGC----------CTGTCCCTAC---TAGAGAGGAACTCTGCTCT"
}
5.7 NEXUS
The popular PAUP, MacClade and Mr.Bayes programs (and others) use a NEXUS format (Maddison, Swofford and
Maddison 1997. Syst. Biol., 46, 590-621). The primary feature of this format is its modularity. Files identify themselves
with the key phrase ‘‘#NEXUS’’ at the beginning of the file. Each block of information begins with ‘‘BEGIN -
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 93
- -’’; and ends with ‘‘END;’’. Comments can be enclosed within square brackets. For these sequences a simple
translation would be
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=3;
TAXLABELS MPU28721 MSU28720 GCU28961;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=650;
FORMAT MISSING=? DATATYPE=DNA INTERLEAVE GAP=-;
MATRIX
MPU28721 -------------------- ----------------CCTG CGGATACTCACCTCCTCCTT GTCTCCTACAAGCACGCGGC CATGTCCGAGTCTGAGTTGA
MSU28720 ---------------TCGGG ATTGACGTGAATTTAGCGTG CTGATACCTACCTCCTCCTT GCCTCCTACACGCACGCGGC CATGTCCGAACCTGAGTTGA
GCU28961 CCTCCGCCCTTGTTCCTGGG ACAGGCTTGACCCTAGCCAG TTGACACCTCACCTCCGCCC TTCCTCT-CACGCACGCGGC CATGGCGGAACCCGAGTTGC
;
END;
BEGIN TREES;
TREE tree1 = (MPU28721, (MSU28720,GCU28961));
TREE tree2 = (MSU28720, (MPU28721,GCU28961));
END;
BEGIN NOTES;
PICTURE TAXON=3 FORMAT=GIF SOURCE=FILE
PICTURE=a_rodent.gif
END;
The major blocks of data that the file format permits are TAXA, CHARACTERS, UNALIGNED, DISTANCES,
SETS, ASSUMPTIONS, CODONS, TREES and NOTES. Only a few of these are shown above and each permits many
other options. Note that the file format permits things such as the phylogeny (or tree) of a group of species to be stored,
pictures of the organisms to be stored or referenced, along with many other capabilities.
5.8 PHYLIP
The PHYLIP programs are also very popular and other programs have incorporated the sequence format used by these
programs. There are two formats that can be used, an interleaved and a sequential format. The phylip-interleaved
format begins with two numbers on the first line. The first number gives the number of taxa or different sequences in
the file. The second number gives the overall length of the sequences. On the next line the sequence information begins
preceded by a sequence title of no more than 10 characters. The APRT sequences in this format (interleaved) would be
3 650
Elementary Sequence Analysis
94 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
5.9 ASN
The Abstract Syntax Notation (asn) format is intended to be read by computer rather than humans. It was developed at
NCBI. It is included here to demonstrate the broad variety of sequence formats in use. For just the Mus speciligus sequence
(complete entry) it would be
db "taxon" ,
tag
id 10103 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "spicilegus" } ,
lineage "Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae;
Mus" ,
gcode 1 ,
mgcode 2 } } } ,
pub {
pub {
gen {
serial-number 1 } ,
gen {
cit "Unpublished" ,
authors {
names
std {
{
name
name {
last "Fieldhouse" ,
initials "D." } } ,
{
name
name {
last "Golding" ,
initials "G.B." } } } } ,
title "Rates of substitution in closely related rodent species" } } } ,
pub {
pub {
gen {
serial-number 2 } ,
sub {
authors {
names
std {
{
name
name {
last "Fieldhouse" ,
initials "D." } } } } ,
imp {
date
std {
year 1995 ,
month 6 ,
day 7 } ,
pub
str "Dan Fieldhouse, Biology, McMaster University, 1280 Main
Street West, Hamilton, ON, L8S 4K1, Canada" } ,
medium other } } } } ,
seq-set {
seq {
id {
genbank {
name "MSU28720" ,
accession "U28720" } ,
gi 881575 } ,
descr {
title "Mus spicilegus adenine phosphoribosyltransferase (APRT) gene,
complete cds." ,
genbank {
source "Steppe mouse." ,
div "ROD" } ,
create-date
std {
year 1995 ,
month 6 ,
day 28 } ,
molinfo {
biomol genomic } } ,
inst {
repr raw ,
mol dna ,
length 2117 ,
Elementary Sequence Analysis
96 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
seq-data
ncbi2na ’DA8F86E0FC9B9E317175D7E5D711919A53B58178BE01EBA669935927D56
1F54356A6E7BD2B9AD189698A6FA65B19D355699299B2925DAA37E6A97795A5119AB4775ED7EF5
4A8CDD9577E0215A1D7D627D4D65DFA52D1782D46449A42361C4D9298BA5F9CA5B9DB5546B5C95
7355FD55DBB4544B7954454D4F7F7D05DE11F5D7EBD74797E867EF455A381CECA2DD5F579CAC57
0A4DE576B9FBD722181DE77B5FBB5205297577FCA91027A524D784929EA215E8175238684D7E7C
AAC977A8E07231C00F2B05FAFAA6E9B97A9217425EB27D2A9EFDD54A1C45AA4DFD5FBD5D1109FB
BC041E7B717A7539789F4804572A49E0ED4528BB522A2AEA45522048BA57AC2E74A85121FF971F
47D73EB157A539D480F2A4ECECD7D51849C8E793F80AE90894532BA5789EF48292B28D5429E23A
5152C94D06F7DC9EB2D242172EF5C90BBE17654C7E97F23D5395749D4D5105F575F15C12B721D4
AA7D7BFA57D5C9D289EA6EA7BB9D35A012A82796A551EED25D73DDE8B3A82B0989EEEC8A0A92AD
F3469C52EDCA2C0EEAE7488AF884FAB4AFC45152019D8A72A2BA51FBD93721DDDF11C7D7B7929E
27A035202397C815A9222EB4FBA385D7A5128AC0814150840487D02A5295ED7AB9E1C24089F831
77F77B57D554A053BF9A5EE379E45275A9E0BAE8BBB897AE89E176782A4A88A7285CC5BDF7775D
4B3878A27A723AD11579D5249D4A079FAE9D254A65C2E17FB89C52595FFB8BBC0’H } ,
annot {
{
data
ftable {
{
data
gene {
locus "APRT" } ,
location
mix {
int {
from 66 ,
to 145 ,
id
gi 881575 } ,
int {
from 277 ,
to 383 ,
id
gi 881575 } ,
int {
from 1354 ,
to 1487 ,
id
gi 881575 } ,
int {
from 1674 ,
to 1752 ,
id
gi 881575 } ,
int {
from 1859 ,
to 2001 ,
id
gi 881575 } } } } } } } ,
seq {
id {
gi 881576 } ,
descr {
title "adenine phosphoribosyltransferase" ,
molinfo {
tech concept-trans } } ,
inst {
repr raw ,
mol aa ,
length 180 ,
seq-data
iupacaa "MSEPELKLVARRIRSFPDFPIPGVLFRDISPLLKDPDSFRASIRLLASHLKSTHSGKID
YIAGLDSRGFLFGPSLAQELGVGCVLIRKQGKLPGPTVSASYSLEYGKAELEIQKDALEPGQRVVIVDDLLATGGTMF
AACDLLHQLRAEVVECVSLVELTSLKGRERLGPIPFFSLLQYD" } ,
annot {
{
data
ftable {
{
data
prot {
name {
"adenine phosphoribosyltransferase" } ,
ec {
"2.4.2.7" } } ,
location
whole
gi 881576 } } } } } } ,
annot {
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 97
{
data
ftable {
{
data
cdregion {
frame one ,
code {
id 1 } } ,
comment "purine salvage enzyme" ,
product
whole
gi 881576 ,
location
mix {
int {
from 66 ,
to 145 ,
id
gi 881575 } ,
int {
from 277 ,
to 383 ,
id
gi 881575 } ,
int {
from 1354 ,
to 1487 ,
id
gi 881575 } ,
int {
from 1674 ,
to 1752 ,
id
gi 881575 } ,
int {
from 1859 ,
to 2001 ,
id
gi 881575 } } ,
xref {
{
data
gene {
locus "APRT" } } } } } } } }
The flat file format (with many rows deleted — indicated by the dots in the center of a row) looks as follows . . .
........................
REMARK 2
REMARK 2 RESOLUTION. 2.3 ANGSTROMS.
REMARK 3
REMARK 3 REFINEMENT.
REMARK 3 PROGRAM : X-PLOR 3.84
REMARK 3 AUTHORS : BRUNGER
REMARK 3
REMARK 3 DATA USED IN REFINEMENT.
REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 2.3
REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 15.
REMARK 3 DATA CUTOFF (SIGMA(F)) : 2.0
REMARK 3 DATA CUTOFF HIGH (ABS(F)) : 100000.0
REMARK 3 DATA CUTOFF LOW (ABS(F)) : 0.1
REMARK 3 COMPLETENESS (WORKING+TEST) (%) : 88.88
REMARK 3 NUMBER OF REFLECTIONS : 278049
REMARK 3
REMARK 3 FIT TO DATA USED IN REFINEMENT.
REMARK 3 CROSS-VALIDATION METHOD : THROUGHOUT
REMARK 3 FREE R VALUE TEST SET SELECTION : RANDOM
REMARK 3 R VALUE (WORKING SET) : 0.209
........................
REMARK 4
REMARK 4 2OCC COMPLIES WITH FORMAT V. 2.2, 16-DEC-1996
REMARK 6
REMARK 6 THIS ENZYME IS A MULTI-COMPONENT PROTEIN COMPLEX AND IS A
REMARK 6 HOMODIMER. EACH MONOMER IS COMPOSED OF 13 DIFFERENT
REMARK 6 SUBUNITS AND SIX METAL CENTERS: HEME A, HEME A3, CUA, CUB,
REMARK 6 MG, NA, AND ZN. THE SIDE CHAINS OF H 240 AND Y244 OF
REMARK 6 MOLECULES A AND N ARE LINKED TOGETHER BY A COVALENT BOND.
REMARK 6 THE ELECTRON DENSITY OF REGION FROM D(Q) 1 TO D(Q) 3,
REMARK 6 E(R) 1 TO E(R) 4, H(U) 1 TO H(U) 6, J(W) 59, K(X) 1 TO
REMARK 6 K(X) 5, K(X) 53 TO K(X) 54 AND M(Z) 41 TO M(Z) 43 IS
REMARK 6 NOISY AND THE MODEL OF THIS REGION HAS AMBIGUITY.
REMARK 200
REMARK 200 EXPERIMENTAL DETAILS
REMARK 200 EXPERIMENT TYPE : X-RAY DIFFRACTION
REMARK 200 DATE OF DATA COLLECTION : MAY-1996
REMARK 200 TEMPERATURE (KELVIN) : 283
REMARK 200 PH : 6.8
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 99
........................
........................
SEQRES 1 A 514 MET PHE ILE ASN ARG TRP LEU PHE SER THR ASN HIS LYS
SEQRES 2 A 514 ASP ILE GLY THR LEU TYR LEU LEU PHE GLY ALA TRP ALA
SEQRES 3 A 514 GLY MET VAL GLY THR ALA LEU SER LEU LEU ILE ARG ALA
SEQRES 4 A 514 GLU LEU GLY GLN PRO GLY THR LEU LEU GLY ASP ASP GLN
SEQRES 5 A 514 ILE TYR ASN VAL VAL VAL THR ALA HIS ALA PHE VAL MET
SEQRES 6 A 514 ILE PHE PHE MET VAL MET PRO ILE MET ILE GLY GLY PHE
SEQRES 7 A 514 GLY ASN TRP LEU VAL PRO LEU MET ILE GLY ALA PRO ASP
SEQRES 8 A 514 MET ALA PHE PRO ARG MET ASN ASN MET SER PHE TRP LEU
........................
SEQRES 1 B 227 MET ALA TYR PRO MET GLN LEU GLY PHE GLN ASP ALA THR
SEQRES 2 B 227 SER PRO ILE MET GLU GLU LEU LEU HIS PHE HIS ASP HIS
SEQRES 3 B 227 THR LEU MET ILE VAL PHE LEU ILE SER SER LEU VAL LEU
SEQRES 4 B 227 TYR ILE ILE SER LEU MET LEU THR THR LYS LEU THR HIS
SEQRES 5 B 227 THR SER THR MET ASP ALA GLN GLU VAL GLU THR ILE TRP
SEQRES 6 B 227 THR ILE LEU PRO ALA ILE ILE LEU ILE LEU ILE ALA LEU
SEQRES 7 B 227 PRO SER LEU ARG ILE LEU TYR MET MET ASP GLU ILE ASN
........................
........................
........................
........................
........................
........................
........................
CONECT264472635326446
CONECT265522625626551
MASTER 370 0 18 98 30 0 2 928870 26 308 292
END
The file begins with the keywords, HEADER, TITLE, COMPND, SOURCE, KEYWDS, EXPDTA, that describe the
nature of the molecule to which this file pertains. The keywords AUTHOR and REVDAT give the authors responsible
for this file and its revision history. The REMARK keyword indicates descriptive entries about the molecular structure
(they are numbered according to their category). These remarks provide enormous detail regarding the structure. DBREF
supplies cross references to entries of this molecule in other databases. SEQRES is the beginning of the actual sequence
information of the molecule. Note the molecule can consist of multiple chains; in this case labelled chain A – chain Z. The
HET, HETNAM and FORMUL fields contain information about atoms/molecules that are associated with the molecule
in question (for HET, the fields here are a letter code for each “HET” atom(s), the letter identifying the chain, insertion
code, number of records with a HET entry, and some descriptive text), their chemical name (in this case a HEME group,
copper ion, ...) and their chemical formula. The HELIX, SHEET, and TURN (not shown above) give information about
the secondary structure of the molecule. Information about connections in the molecule are shown by SSBOND, LINK,
HYDBND, SLTBRG, and CISPEP (the last three not shown in the above structure).
And much more information is provided by other fields too numerous to list here. The business end is in the ATOM field.
This contains a numbered list of atoms (in this case 28,895 of them), the atom name, the (amino acid) residue name, the
chain identifier number, the residue sequence number, then three numbers that describe the x, y, z coordinates of this atom
in Angstrom units, an occupancy number, a temperature factor and finally an element symbol.
The TER field indicates the end a section. The CONECT section provides further information on chemical connectivity.
The MASTER and END fields are used to describe the number of records of different types and to signal the end of the
file.
Elementary Sequence Analysis
102 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Chapter 6
Sequence Alignment
The comparison of sequences can be done in many different ways. The most direct method is to make this comparison
via a visual means and this is what “dot plots” attempt to do. Dot plots are a group of methods that visually compare two
sequences and look for regions of close similarity between them.
The sequences to be compared are arranged along the margins of a matrix. At every point in the matrix where the two
sequences are identical a dot is placed (i.e. at the intersection of every row and column that have the same letter in both
sequences). A diagonal stretch of dots will indicate regions where the two sequences are similar. Done in this fashion a
dot plot as shown in Figure 6.1 will be obtained. This is a dot plot of the globin intergenic region in chimpanzees plotted
against itself (bases 1 to 400 vs. 1 to 300) The solid line on the main diagonal is a reflection that every base of the sequence
is trivially identical to itself. As can be seen this dot plot is not very useful unless applied to protein sequences (where the
background is much less dense), however some statistical methods can still be applied to the results (Gibbs and McIntyre
1970, Eur. J. Biochem. 16:1).
Maizel and Lenk (1981, PNAS 78:7665) popularized the dot plot and suggested the use of a filter to reduce the noise
demonstrated in Figure 6.1. This noise is caused by matches that have occurred by chance. Because only four different
nucleotides are possible, nucleotides will match other nucleotides elsewhere in the sequence without any homology present
and hence are not a true reflection of the similarities between the sequences but rather reflect the limited number of bases
permitted in DNA sequences. There are a wide variety of filters that can be used, indeed they are only limited by your
imagination. The one suggested by Maizel and Lenk was to place a dot only when a specified proportion of a small group
of successive bases match. In Figure 6.2 the same dot plot is reproduced with a filter such that a window of 10 bases is
highlighted only if 6 of these 10 bases match. In Figure 6.3 the same plot is again shown with a filter of 8 out of 10 matches.
Note that these plots highlight the complete window while other programs might highlight a single point centered by the
window. Another common way to filter the matches is to give them a weight according to their chemical similarity (Staden
1982, Nuc. Acids Res. 10:2951).
The computational work involved with the generation of these matrices can be quite time consuming. If you are comparing
a sequence of length N with another sequence of length M, then the total number of windows for which matches must be
calculated is N × M . Hence the amount of work increases with the square of the sequence length. This rapidly becomes a
large number. For example with N = 700 and M = 400, N × M = 280, 000.
Elementary Sequence Analysis
104 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
There is another way in which dot plots can be generated very quickly. This involves a computer method commonly known
as “hashing” (list-sorting). As mentioned previously, these methods are incorporated into the FASTA algorithms. Basically,
the idea is that instead of taking the complete matrix and calculating points for every entry in that matrix, a great saving
can be made if the algorithm searches only for exact matches. Hence, this method looks only for blocks of perfect identity.
The computational complexity of this algorithm grows linearly with increasing N.
The algorithm simply sub-divides the sequence into all “words” of a user specified block size. The same is done for the
alternate sequence. In addition, for both sequences the location of each word is also recorded. These arrays of “words”
are then sorted alphabetically and the arrays of locations are sorted in parallel with the “words”. Then, by comparing the
sorted array from one sequence with that from the other sequence immediately gives the location of all identical “words”.
An algorithm which does be used to generate the dot plots shown in Figure 6.4 for identity blocks of length 5. The rapidity
of this method compared to the exact method can be demonstrated by the dot plot shown in Figure 6.5 (with identity blocks
of length 6). This figure extends the sequences compared in the chimpanzee globin intergenic region from (1-400 vs 1-300)
up to (1-4000 vs 1-3000). The length of time required for a plot of the small region is not significantly shorter than the
length of time it takes to calculate short identities on a 100 fold larger matrix.
The beauty of this method is demonstrated in Figure 6.6. This is a plot of all identities of length 6 between the chimpanzee
and spider monkey sequences in the same region. The evolutionary homology between these sequences is easily discernible
by the solid lines along the main diagonal despite the approx. 60 million years that separate these two groups. Further more,
this is intergenic DNA with no known function to selectively maintain this homology (modulo an even more ancient eta-
globin pseudogene). The insertion of some DNA is easily observed within chimpanzee sequence and then a corresponding
deletion further down. These correspond to the insertion of an Alu element in the chimpanzee (and human and other ape)
sequences (at approx. bp 1000) and then the presence of a truncated L1 element in the spider monkey (inserted at approx.
bp 2600) that is not present in the great apes. These events are difficult to find by a simple inspection of the actual sequence
code but are readily found by a visual inspection.
A more distant similarity can be seen in Figure 6.7. This is a plot of the identities of length 6 between the same region of
the chimpanzee haemoglobin intergenic region and another intergenic region from the spider monkey. Note the similarity
(the short diagonal line) in the circled region. This region of similarity corresponds to the location of another Alu element
in the chimpanzee sequence.
There are many programs freely available to make dot plots. One which is particularly fast and interactive is the dotter
Elementary Sequence Analysis
106 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 6.5: Identities of length 6bp. Chimpanzee hemoglobin intergenic DNA against itself.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 107
Figure 6.6: Identities of length 6bp. Chimpanzee hemoglobin intergenic DNA against spider monkey.
Figure 6.7: Identity dot plot. Chimpanzee hemoglobin intergenic region vs. Spider Monkey unrelated intergenic region.
Elementary Sequence Analysis
108 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 6.8: Human calmodulin protein sequence dot plotted against itself. Since the dotplot is symmetrical only the lower
half is shown. Also note the margin around the edge where a complete window could not be calculated.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 109
Figure 6.9: Human epidermal growth factor protein sequence dot plotted against itself. Since the dotplot is symmetrical
only the lower half is shown.
Elementary Sequence Analysis
110 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 6.10: Human globin region (zeta, psizeta, psialpha1, alpha2, alpha1) dot plotted against itself. Since the dotplot is
symmetrical only the lower half is shown.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 111
Figure 6.11: Human zeta globin intergenic region expanded from Figure 6.10. Since the dotplot is symmetrical only the
lower half is shown.
Elementary Sequence Analysis
112 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
program. Some other interesting dot plots are comparisons of the calmodulin (Figure 6.8) protein against itself and the
human epidermal growth factor (Figure 6.9) against itself. Both show internal repetitive elements. The neatest dot plot that
I have yet seen is the human zeta globin (Figure 6.10) region and if you zero in on the intergenic region (Figure 6.11) the
plot becomes fantastic (try to interpret this dot plot).
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 113
6.2 Alignments
The dot plots provide a useful way to visualize the sequences being compared. They are not very useful however in
providing an actual alignment between the two sequences. To do this, other algorithms are required.
First, a word on terminology. People in the field shudder when the terms similarity and homology are used indiscriminately.
Similarity simply means that sequences are in some sense similar and has no evolutionary connotations. Homology refers
to evolutionary related sequences stemming from a common ancestor.
The ability to calculate the correct alignment is crucial to many types of studies. It may, for example, alter from which
part of a gene one segment was duplicated, it may alter the inferred number of point mutations, it may alter the inferred
location of deletions / insertions, alter the inferred distance between species, and may alter the inferred phylogeny of the
sequences along with whatever evolutionary hypotheses are dependent on these phylogenies.
An explicit and precise algorithm is also required. For example one paper in the prestigious journal NATURE stated that
the alignment
------CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT---------
was optimal in the sense that gaps were inserted to maximize the number of base matches (the base matches are high-
lighted). They obviously did this alignment by eye and did not use an explicit algorithm. An alternate alignment (due to
Fitch 1984, Nature 309:410) is
CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCA---CCCAGCCCCCTGGACCTGTAT
This alignment not only increases the number of base matches by 133 per cent, but also decreases the number of gaps by 50
per cent and reduces the number of gapped residues by 80 per cent. Hence, if the number of base matches can be increased
by reducing the number of gaps, then clearly the original author’s insertion of gaps did not maximize that number. Fitch
recommends that the authors change their statement to the assertion that gaps were introduced to increase the number of
base matches (rather than to maximize them). More generally this example shows the importance of i) using a well defined
algorithm and ii) of using a computer based algorithm to perform these calculations. Even alignments that may appear
simple and straightforward, if given to the computer, might yield alternatives that you did not consider.
A B C N J R Q C L C R P M
A 1 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 1 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 1 0 1 0 0 0
J 0 0 0 0 1 0 0 0 0 0 0 0 0
N 0 0 0 1 0 0 0 0 0 0 0 0 0
R 0 0 0 0 0 1 0 0 0 0 1 0 0
C 0 0 1 0 0 0 0 1 0 1 0 0 0
K 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 1 0 1 0 0 0
R 0 0 0 0 0 1 0 0 0 0 1 0 0
B 0 1 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 1 0
second pass can be carried out, this time running left to right, top to bottom to find that alignment that gives the maximum
score.
The way to trace a score for all possible paths is shown in Table 6.2. For each element in the matrix you perform the
following operation.
where k is any integer larger than i and l is any integer larger than j. In words, alter the matrix by adding to each element
the largest element from the row just below and to the right of that element and from the column just to the right and below
the element of interest. This row and column for one element are shown in Table 6.2 by boxes. The number contained in
each cell of the matrix, after this operation is completed, is the largest number of identical pairs that can be found if that
element is the origin for a pathway which proceeds to the upper left.
We wish to have an alignment which covers the entire sequence. Hence, we can find on the upper row or on the left column
the element of the matrix with maximum value. An alignment must begin at this point and can then proceed to the lower
right. This is the second pass through the matrix. At each step of this pass, starting from the maximum, one moves one
row and column to the lower right and finds the maximum in this row or column. The alignment must proceed through this
point.
Continuing in this fashion one eventually hits either the bottom row or the rightmost column and the alignment is finished.
This tracing pattern is shown in Table 6.3. Note that in this case the optimal alignment is not unique. There are two
alignments and both give the optimal score of 8 matches.
These two alignments can be written in more familiar form as either
ABCNJ-RQCLCR-PM
* * * * * ** *
AJC-JNR-CKCRBP-
or as
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 115
A B C N J R Q C L C R P M
A 1 0 0 0 0 0 0 0 0 0 0 0 0
J 0 0 0 0 1 0 0 0 0 0 0 0 0
C 0 0 1 0 0 0 0 1 0 1 0 0 0
J 0 0 0 0 1 0 0 0 0 0 0 0 0
N 0 0 0 1 0 0 0 0 0 0 0 0 0
R 0 0 0 0 0 1 4 3 3 2 2 0 0
C 3 3 4 3 3 3 3 4 3 3 1 0 0
K 3 3 3 3 3 3 3 3 3 2 1 0 0
C 2 2 3 2 2 2 2 3 2 3 1 0 0
R 2 1 1 1 1 2 1 1 1 1 2 0 0
B 1 2 1 1 1 1 1 1 1 1 1 0 0
P 0 0 0 0 0 0 0 0 0 0 0 1 0
ABC-NJRQCLCR-PM
* * * * * ** *
AJCJN-R-CKCRBP-
both with 8 asterisks to denote the 8 matches. Note that in this particular case, gaps are given the same penalty as a
mismatch. They simply do not add to the score.
The Needleman-Wunsch algorithm creates a global alignment. That is, it tries to take all of one sequence and align it
with all of a second sequence. Short and highly similar subsequences may be missed in the alignment because they are
outweighed by the rest of the sequence. Hence, one would like to create a locally optimal alignment. The Smith and
Waterman (1981, J. Mol. Biol. 147:195-197) algorithm finds an alignment that determines the longest/best subsequence
pair that give the maximum degree of similarity between the two original sequences. This means that not all of the
sequences might be aligned together.
The Smith-Waterman algorithm is very similar to the Needleman-Wunsch algorithm and again finds the best subsequence
matches and builds upon these. Three conceptual changes are required,
• The beginning and end of an optimal path may be found anywhere in the matrix - not just the last row or column.
The first point is required to cause the score to drop as more and more mismatches are added. Hence, the score will rise in
a region of high similarity and then fall outside of this region. If there are two segments of high similarity then these must
be close enough to allow a path between them to be linked by a gap or they will be left as independent segments of local
similarity. In general the Smith-Waterman algorithm includes gap penalties (to be discussed in section 6.4) and if this is
the case, then the mismatch penalties are not required to be negative (to retain simplicity here, I have assumed a negative
mismatch penalty). Either way, the essence of a local alignment is that the score must decline.
The second point is required so that each pathway begins fresh at its beginning. Thus each short segment of similarity
should begin with a score of zero (and hence the matrix is intialized with the first row and column equal to zero). The third
point indicates that the entire matrix must be searched for regions with high local similarity.
For each element in the matrix you perform the following operation.
By tradition, the matrix is filled out left to right, top to bottom (the opposite direction of Needleman and Wunsch). Here,
si,j is the score for characters i and j (they either match or mismatch). The terms Wk are the gap penalties for a gap of
length k ending at either i or j. Similarly gap penalties can be added to the Needleman-Wunsch method.
As an example the previous alignment can be reproduced with score of 1.0 for a match, -0.5 for a mismatch and Wk =
−0.5k for a gap of length k. The matrix will then be as given in Table 6.4. In this case largely the same alignment is
found. However, the Smith-Waterman algorithm ends at the maximum score and begins at the first no-zero score. In this
case it includes the same ambiguity in the alignment but it ends with the alignment of the two P’s. The M is not part of
this alignment. More generally, large chucks of each sequence may be missing from a local alignment (as in the alignment
presented by BLAST) as opposed to a global alignment .
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 117
It is seldom the case that the Smith-Waterman and the Needleman-Wunch algorithms give the same answer. For example,
a global and a local alignment of TTGACACCCTCCCAATTGTA versus ACCCCAGGCTTTACACAT give
TTGACACCCTCC-CAATTGTA
:: :: :: :
ACCCCAGGCTTTACACAT---
---------TTGACACCCTCCCAATTGTA TTGACAC
:: :::: or :: ::::
ACCCCAGGCTTTACACAT----------- TTTACAC
respectively. The global alignment has considered a penalty for the end gaps but the local alignment has simply searched
for the best substrings that can be put together.
If the sequences are not known to be homologous throughout their entire length, a local alignment should be the method
of choice. Sometimes the two methods will give similar answers but if the homology is distant, a local alignment will be
more likely to find the remaining patches of homology.
THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
:: :.. . .. ...: : ::::.. :: . : ...
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------
Elementary Sequence Analysis
118 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Depending on your intuition you may or may not think this to be a pretty good alignment particularly at the amino-terminus.
There are a total of 12 exact matches and 14 conservative substitutions. But there is obviously no homology between these
two “sentence” sequences. How do we test whether or not an alignment is significant?
As a more biological example, consider the alignment of human alpha haemoglobin and human myoglobin. If you remem-
ber your basic biology, you should remember that these two proteins do similar functions of transporting oxygen in the
blood and muscle respectively. But are they evolutionarily related? An alignment of the two looks like ...
Human alpha haemoglobin (141 aa) vs. Human myoglobin (153 aa)
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQ
:: .. : ..::::.:. ..:.:.: :.: . :.: . : .: .:. ..:..
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED
VKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP
.: :: .: .::.. . . .. .....:.. :: : .. ....:.:.. .:... :
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHP
AEFTPAVHASLDKFLASVSTVLTSKYR------
..:.........: :. .. ..:.:.
GDFGADAQGAMNKALELFRKDMASNYKELGFQG
Again looks like a reasonably good alignment. Or how about chicken lysozyme and bovine ribonuclease. An alignment of
these gives
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS
: . :: ..:. .:. . . .. :.....:. :.. . ... .. .. ....
KETA----AAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQA
RWWCNDGRTP--GSRNLCNIPCSALLSSDITASVNCAKKIVSDGDGMNAWVAWRNRCKGT
:.. ... .... : ..:.. .: .. ...: .. .. .: :.:.
V--CSQKNVACKNGQTNCYQSYSTMSITDCRET-GSSKYPNCAYKTTQANKHIIVACEGN
DVQAWIRGCRL
. . ..
PYVPVHFDASV
Again a reasonable alignment or so it seems. How do you know which sequences (if any) are homologous?
A common and simple test to determine if the alignment of two sequences is statistically significant is to carry out a simple
permutation test. This consists of
Doing this say 10000 times, gives a distribution of alignment scores that could be expected for random sequences with a
similar amino acid content. If the actual alignment has a score much higher than that of the permuted sequences, then you
know that they must be homologous to some extent.
A plot of 10,000 alignment scores for the human myoglobin and human alpha haemoglobin sequences are shown in figure
6.12. The permuted scores range from 14 to 75 but most are less than 50. Also note the skewness of the distribution -
statistics based on a normal distribution would be strongly biased. The skew is expected since in each case the alignment
algorithm is trying to maximize the score. The score for the alignment of the two actual sequences is 179 (indicated by the
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 119
arrow). Obviously, myoglobin and haemoglobin are evolutionarily related and still retain many features of their homology.
This alignment has a probability of less than 0.0001 of occurring by chance alone.
A plot of 10,000 alignment scores for the chick lysozyme and bovine ribonuclease sequences are shown in figure 6.13.
Again note, the skew and note that this “random” distribution is somewhat different from the haemoglobin “random”
distribution. This is due to the differential effects of amino acid composition in these proteins. The permuted scores range
from 14 to 72. The actual score for the proteins is 30 (indicated by the arrow). Obviously whatever homology that once
existed between these proteins has been completely destroyed by time.
These two examples are clear cut. There is a large grey area where the tests may be uncertain of the degree of homology
between sequences. For protein sequences Doolittle’s rule of thumb is that greater than 25% identity will suggest homology,
less than 15% is doubtful and for those cases between 15-25% identity a strong statistical argument is required. Personally,
I would prefer the statistical test in all cases, since they are easy to do and things such as internal repeats and unusual amino
acid compositions can sometimes confuse the picture.
Wk = a + bk
where k is the length of the gap. Hence you can control whether many short gaps occur or whether long gaps occur but
more infrequently. Deletions do occur but when they occur it is seldom many small, short deletions but rather fewer and
longer deletions. These types of penalties are termed affine gap penalties.
How do you choose a gap penalty? Unfortunately, there is little knowledge to help here. Most of the tests done so far
depend on an empirical basis designed to achieve some end. For example, Smith and Fitch have derived (by exhaustive
search) gap penalties that will best align distantly related haemoglobin genes. But there is no guarantee that these values
would work well for the protein or (worse) the nucleotide sequence that you are interested in. Typical values are
but there is nothing special about these values other than the fact that they seem to work well for some of the common
comparisons. Note that in general a > b. This corresponds with biological knowledge of how gaps are generated - it is
easier to generate one gap of two residues rather than two gaps of one residue since the former can be created by a single
mutational event.
More recently, Reese and Pearson (2002, Bioinformatics 18:1500-1507) examined how these parameters might change as
a function of the distance between aligned sequences. Their criteria was the “correct” identification of distant homologues.
They found that b did not change but that a did change with distance. Again, through empirical tests they showed that
optimal penalties were a = 25 − 0.1 × (P AM distance) (PAM is a method of measuring distance that will be explained
in section 7.2.1) and b = 5 where these penalties are in 1/3 bit units (see section 10.5.1).
1. A gap and its length are distinct quantities. Different weights should be
applied to each.
2. Weights for different mismatches should be permitted. A transition
is more likely than a transversion; a Ile-Val more likely than Ile-Arg
change.
3. If the two sequences have no obvious relationship at their right and left
ends, then end gaps should not be penalized.
4. Unless two sequences are known to be homologous over their entire
length, a local alignment is preferable to a global alignment.
5. An optimal alignment is by no means necessarily statistically significant.
One must make some estimate of the probability that a given alignment
is due to chance.
6. An alignment demonstrates similarity, not necessarily, homology. Ho-
mology is an evolutionary inference based on examination of the simi-
larity and its biological meaning. Sequence similarity may result from
homology but it may also result from chance, convergence or analogy.
They develop a maximum likelihood method to examine the possible paths of descent from a common ancestor for two
sequences. The creation of gaps is modeled as a birth - death process with separate parameters for birth rate and death rate.
The model then finds the likelihood of particular paths through the matrix given the transition parameters. It then examines
alternative parameters and chooses that path and parameter set with the highest likelihood. The big difficulty with this
method is the enormous computer time required to carry out the calculations.
A related question is the assignment of weights to individual differences in nucleotide or protein sequences (more on this
later). There have also been advances in methods to try to find statistically bounded sets of alignments. That is, the set of
alignments that are within 95% confidence limits of some best answer. Again another fertile area were many significant
improvements are being made.
It is important to realize that an optimal alignment is optimal only for the particular values chosen for the mismatch and
gap weights. When any of these are altered, the optimal alignment will also change. Also be aware of the fact that nature is
seldom mathematically optimized. Fitch and Smith (1983, PNAS 80:1382) have derived a set of “rules of thumb” a subset
of which are given in Table 6.5. Even with the very best programs it still requires some degree of experience to draw the
right conclusions from the results produced and a good grasp of the biology of the problem is essential.
: . : . : . : : :: . :
HUMAN SSDDIKETG-YTYILPKNVLKKFICISDLRAQIAGYLYGVSPPDNPQVKEIRCIVMVPQWGTHQTVHLPGQLP---QHEYLKEMEPLGWIHTQPNESPQLSPQDVTTHA 215
Mouse SSDDIKETG-YTYILPKNVLKKFICISDLRAQIAGYLYGVSPPDNPQVKEIRCIVMVPQWGTHQTVHLPSQLP---QHEYLKEMEPLGWIHTQPNESPQLSPQDVTTHA 215
Drosophila SSDDIKETG-YTYILPKNILKKFVTISDLRAQIAGYLYGVSPPDNPQVKEIRCIVMPPQWGTHQTINLPNTLP---THQYLKDMEPLGWIHTQPNELPQLSPQDITTHA 215
Anopheles SSDDIKETG-YTYILPKNVLKKFVTISDLRAQIAGYLYGVSPPDNPQVKEIRCIVMPPQWGTHQQINLPSSLP---AHQYLKDMEPLGWIHTQPNELPQLSPQDITTHA 215
C.elegans NSDDVKDTG-YTYILPKNILKKFITISDLRTQIAGFMYGVSPPDNPQVKEIRCIVLVPQTGSHQQVNLPTQLP---DHELLRDFEPLGWMHTQPNELPQLSPQDVTTHA 215
Dictyostel NSDNAKETGGFTYVFPKNILKKFITIADLRTQIMGYCYGISPPDNPSVKEIRCIVMPPQWGTPVHVTVPNQLP---EHEYLKDLEPLGWIHTQPTELPQLSPQDVITHS 217
Rice NSDDIKETG-YTYIMPKNILKKFICIADLRTQIAGFLYGLSPQDNPQVKEIRCIAIPPQHGTHQMVTLPANLP---EHEFLNDLEPLGWMHTQPNEAPQLSPQDLTSHA 215
SCHIZOSACC NSDNISETFPYTYILPQNLLRKFVTISDLRTQVAGYMYGKSPSDNPQIKEIRCIALVPQLGSIRNVQLPSKLPHDLQPSILEDLEPLGWIHTQSSELPYLSSVDVTTHA 219
PARAMECIUM NSDDIKQTG-FTYVLPKNILKKFISIADLKTQIAAYLYGISPPDNLQVKEIRAIVMIPQIGSRDNVTMPHQMP---DSEYLRNLEPLGWLHTQSTETMHLSTYDITLHA 215
TRICOMONAS PPPVVKPKL--ELIIPENIYRRFVEISDPYMQICGFLFGVKMNDTLQVISI---VIPPQNGDRDEIDFKQILP---NHDFLDGASPIGFIHTRVGENSSLEPRDAKVLA 210
LEISHMANIA DAAGATGAS-DQLIFSEDAIQKLLACCDVKVQCCAYMLGHALPDSPNIKEVLCVMIPPQFGTAVEARTPPRIPFDAAALQEANLSFLGLMRIGESE-AQLTSHDLALQ- 215
CRITHIDIA DAAGATGAS-DQLIFAEDTLQKLLACCDVKVQCCAYMFGHALPDSPNIKEVLCVMIPPQFGTAVESRTPPRIPFDAPALQEANLSFLGILRIGESE-AQLTSHDLALQ- 215
TRYPANOSOM DHSGVTGSS-DQLIFPQELLKILFPCFDVQAQFCAYLFGQTLPDSPNVKEVLCIMVPPQKSSAVEYTTPSCIPHDHPILTENHLSLLGVLRCSGGE-PSIHSRDVAIHG 216
Guilardia YTKFLIKQN--EIEYYCYILGKFCKHNTKCKLFIILKFQLGKTATNPLIYNNMNLFIKRFSFLGFFTEKTLFESNMKNILNNNQGIFCIFKKKYISWVTITK------- 215
ruler ..120.......130.......140.......150.......160.......170.......180.......190.......200.......210.......220....
Figure 6.14: An example of the output from ClustalX, a popular multiple sequence alignment program. The shaded
diagram at the base provides a measure of the similarity within each column.
these local regions of similarity and turn out an alignment is somewhat more complicated.
Bains (1986, Nuc. Acids Res. 14:159) suggested an iterative method which involves successive applications of the standard
algorithms. It begins with a trial consensus alignment (say the alignment between sequences 1 and 2. Then the third
sequence is aligned against the consensus sequence and a new consensus emerges. This continues until the consensus
alignment converges to a global consensus. This type of method will be very dependent on the order that the sequences
are introduced. Thus a different alignment could arise using the same technique and the same sequences but in a different
order.
One of the most popular multiple alignment programs begins with all pairwise alignments and is called Clustal. It was
written by Higgins and Sharp (1988, Gene 73:237; 1989, CABIOS 5:151). The alignments are done in four steps. In the
first step, all pairwise similarity scores are calculated. This is done using rapid alignment methods. The second step is to
create a similarity matrix and then to cluster the sequences based on this similarity using a cluster algorithm (see the section
9.2). The third step is to create an alignment of clusters via a consensus method. The final step is to create a progressive
multiple alignment. This is performed by sequentially aligning groups of sequences, according to their branching order
in the clustering. Three variants currently are being used, ClustalW, a companion program call ClustalX and the older
ClustalV. An example of the output from ClustalX is shown in Figure 6.14.
Other methods make use of a multiple dimensional dot plot and then look for dots that are common to each group (Vingron
& Argos 1991 J.Mol.Biol. 218:33-43). Still others rely heavily on user input such as the popular windows program
MACAW (Schuler, Altschul & Lipman, 1991 Proteins Struct. Func. Genet. 9: 180-190). Others such as MSA (Gupta,
Kececioglu & Schaffer, 1995 J. Comput. Biol. 2:459-472) attempt to provide a near-optimal sum-of-pairs global solution
to the multiple alignment. Most of these programs attempt to find a solution such that some measure of the multiple
alignment is minimized (or maximized). Most however, can only provide a guess to the best solution. Kececioglu has
developed a new branch and bound algorithm that is guaranteed to converge to the true optimal solution (just no guarantees
on how long that will take). This whole area is ripe for major theoretical advances and for the creation of better interface
programs.
One obvious extension of these algorithms is to construct an alignment and a phylogeny for sequences all at the same time.
This is because the alignment will affect distances between sequences and this will affect the inferred phylogeny. Similarly
a different phylogeny will imply a different alignment of the sequences. Now you are talking about a chicken and egg
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 123
problem! Never-the-less, some progress has been made in this area. Jotun Hein has come up with a program TREEALIGN
which will do exactly this. It is available in the list from EMBL software given at ftp.ebi.ac.uk/pub/software/unix but again
it is a very slow program.
How well do they actually work? McClure, Vasi and Fitch (1994, Mol. Biol. Evol. 11:571-592) tested how well the
different algorithms could detect and correctly align, ordered functional motifs in several proteins. They used haemoglobin
(5 motifs), kinase (9 motifs), aspartic acid protease (3 motifs), and ribonuclease H (4 motifs) proteins. They calculated the
number of times (out of 100) that different algorithms correctly aligned these motifs for each protien. The results obviously
depend on the divergence of the proteins, the number of sequences, the length of the motifs and the indel penalties but were
often disappointing. As an example, the results for just ClustalV with 6 sequences were (100, 92, 100, 100, 100), (100,
83, 67, 100, 100, 100,100, 100, 100), (100, 0, 67) and (100, 67, 50, 50), respectively. Note that these motifs should be
highly conserved and retain the most information enabling a correct alignment. ClustalV was one of the better algorithms
but would still often miss these motifs.
When all is said and done, people will still find that the alignments produced by the programs can be improved by a
judicious and critical examination by eye. Spending time to slowly and carefully examine your alignments by hand is
recommended. Occasionally you might see an alignment that contains
A -
A -
- -
- -
- A
- A
when an obviously better alignment would be
A
A
-
-
A
A
But why should this be necessary?
Many algorithms make some use of a tree or phylogeny in the construction of the alignment. It is how this information is
used that can create some of these problems. If the nodes, S, containing the above deletion are central to the phylogeny,
e.g.
then insertions of the block ‘A’ must be made independently within each evolutionary branch. This will incur the same
penalty in the local alignment whether it is placed to the left or to the right.
Similarly you might see
A B −
A B −
− − A
− − A
− B −
− B −
instead of
Elementary Sequence Analysis
124 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
A B
A B
A −
A −
− B
− B
when the phylogeny shown on the right of the diagram (in red) is used. For many algorithms these are situations where
the “apparent” score changes little or at all and hence the algorithm will not recognize it as a possibility for improvement
(pers. comm. John Kececioglu).
In addition to these problems, all algorithms that I know of consider the penalty applied to gaps (and mismatches) as a
constant throughout the length of the sequence. Yet all biologists recognize that this is not the case and understand that
indels are more likely at the ends of a sequence and more likely in loop regions than in catalytic centers. We also have little
idea of what are appropriate quantitative levels for the gap penalties. As a result, the alignments can always be improved
by a careful examination. The algorithms can help with task. We routinely do several automated alignments for every
comparison (minimally one with the default penalties, one with more severe and one with less severe penalties) and then
compare these by hand.
Chapter 7
Distance Measures
One of the most common measures used in computer algorithms for sequence analysis is some measure of the distance
between two sequences. For many methods it is absolutely critical to get an accurate measure of distance. Past studies
have shown that most algorithms that make use of a distance are not robust to small deviations in the distance matrices.
This problem is also related to weighting differences between sequences.
Why bother about corrections for distances? Consider the
analogy of the difference between a Toyota Corolla and a The malaria parasite jumped hosts!
Honda Civic. This is not the same difference as that be-
tween a Civic and a Mercedes. They are each different
cars but there is a greater qualitative difference between
them (minimally, a big difference exists in the price be-
tween a Civic and a Mercedes but less so between a Civic
and a Corolla).
The same thing applies for sequences. Two sequences that
This figure is from Rich et al. 2009 and shows π, a measure of
differ by an A and G do not have the same quality of differ- genetic distances due to polymorphisms within a species. The
ence as do two sequences that differ by an A and a T. The bars show π for three genes from (right to left) (i) Plasmodium
former substitution is a transition and can happen readily faliciparum alone (ii) P. reichenowi alone, (iii) P. reichenowi +
while the latter is a transversion and occurs far less fre- P. falciparum, (iv) P. reichenowi, P. falciparum, + P. vivax.
quently. Hence it would be desirable to weight or to treat Note that the genetic distances within P. reichenowi a parasite
these substitutions in a different fashion. There is no rea- that infects chimpanzees is not increased by the addition of se-
quences from P. falciparum, the species that infects humans and
son why we should have used 1 for a residue match and 0 cause malaria.
for a mismatch in the section on alignments. You can use
The authors reasoned that if the two malarial parasites cospeciated
any value for these that you wish (and indeed 1 and 0 are
with humans and chimps they should be 5-7 million years old (this
particularly poor choices). estimate is yet another application of distance measures). They
went on to measure genetic distances for each gene and show that
An example of just one interesting study (from among P. falciparum and P. reichenowi are too similar and suggest they
thousands) that use genetic distance measures is shown in originated from 10,000 to 1 million years ago.
the Box. Not only is this a typical (and, in my opinion,
For each gene, cytB, clpC, and 18S rRNA, they estimated the best
fascinating) application, but distance measures are a basic distance model as F81 + Γ, GTR + Γ, and HKY + I + Γ, respec-
groundwork for much that follows. tively. It is these types of models we discuss in this section.
by counting the number of nucleotide differences between the species. Lets consider how this difference, this measure
of distance changes over time. Figure 7.1 shows the difference expected between two sequences that have diverged at
increasing times into the past. The proportion of differences are calculated simply by counting the number of nucleotide
differences divided by the total length of the sequence. Hence,
D = k/n,
where n is the length of the sequence and k is the number of nucleotides that differ. In Figure 7.1, µ is rate of substitutions
for the sequences and t is the length of time since the last common ancestor of these sequences. The rate of change is
initially going up with a slope equal to twice what one might expect from the product of the mutation rate and time because
both sequences are diverging from a common ancestor. Figure 7.1 shows that as the time of divergence increases the
percent difference or the distance increases. Initially this occurs linearly however as time proceeds the measure of distance
begins to slow its increase and finally reaches an asymptote of 0.75 and ceases to increase at all.
This is quite reasonable when you think about it. There are only four types of nucleotides. A random collection based on
these four possibilities will have one quarter of them identical by chance alone. But this has lots of implications for the
distances that are calculated between species. A pair at time t = 20 are expected to have D = 7.6 and a pair at t = 40 are
expected to have D = 14.4. These can be easily distinguished. But a pair at t = 500 and t = 1000 have D’s of 69.8 and
74.6. These will be hard to tell apart. And yet, in both cases there is a doubling of the divergence between species pairs.
D(1 − D)
V ar(DJC ) =
n(1 − ( 43 )D)2
k
∗
X k (i)
DJC =
i=1
i( 43 )i−1 n(i)
where
With variance
∗ ∗ ∗ ∗
V ar(DJC ) = DJC (1 − DJC )exp(8DJC /3)/(n − 1).
Here k is the count of differences between the two sequences and n is the length of these sequences. This is actually
just a different formulation of the same quantity using a Taylor series expansion to avoid the logarithm. This estimator of
distance is defined for all parameter values and actually has less bias than Jukes and Cantor’s original correction for small
levels of divergence. Tajima provides similar adjustments to all of the corrections noted below.
Elementary Sequence Analysis
128 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Note that this still does not correct for differences in the rates of transition and transversion. To do this you can use what
is called the Kimura 2-parameter correction. This was a method established by Kimura (1980; J. Mol. Evol. 16:111-120)
where the rates of transitions are assumed to be α and the rates of transversions are β. Then if the observed percentage of
transitional differences are P and the observed percentage of transversion differences are Q, the estimate of distance is
1 1
DK2p = −( ) ln(1 − 2P − Q) − ( ) ln(1 − 2Q)
2 4
and
where c1 = 1/(1 − 2P − Q), c2 = 1/(1 − 2Q) and c3 = 12 (c1 + c2 ). Again divergence follows a logarithmic function.
In this case you can also determine the rates of substitution via transitions and transversions separately. The rate of
transition substitutions per site is
1 1
s = −( ) ln(1 − 2P − Q) + ( ) ln(1 − 2Q)
2 4
1
v = −( ) ln(1 − 2Q)
2
Hasegawa, Kishino and Yano (1985, J. Mol. Evol. 22:160-174) suggested a model that Tamura and Nei (1993,
Mol. Biol. Evol. 10:512-526) have extended. They suggest a model with different rates of transversions β, and transi-
tions as α1 and α2 between purines and between pyrimidines respectively. They also consider mutation rates that yield the
observed frequency of A, T, C and G (gA , gT , gC , gG ). In this case, it can be shown that the distance is
where P1 , P2 , Q are the proportions of transitions between A and G, between T and C, and the proportions of transversions.
The variance has also be derived but is very complicated.
Other more complicated corrections are possible. For example, Felsenstein and Hasegawa have developed likelihood
methods that find a maximum likelihood estimate of the distance between two sequences with mutation rates estimated
from the actual sequences. It has also been demonstrated that such maximum likelihood estimates of distances are much
more accurate than log-transform estimates Hoyle and Higgs (2003, Mol. Biol. Evol. 20:1-9)
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 129
α = 0.5 α = 10.0
α = 5.0
α = 0.1
f(x) α = 2.0
α = 1.0
and where
Z ∞
Γ(α) = x(α−1) e−x dx.
0
A plot of gamma distributions is given in Figure 7.3. In distance measures, it is generally used with α set to µ2 /V ar(µ)
and β set equal to α, where µ is the overall rate of substitution. This provides an interpretation such that α is the inverse
of the coefficient of variation of substitutions among sites squared. Therefore the smaller the parameter α the higher the
extent of variation in substitution rate. The distribution is completely determined by this mean rate of substitution and its
coefficient of variation. Therefore only one extra parameter is being used to determine a variety of distributional shapes.
Since the mean of the gamma distribution is α/β, the mean will always be one in this case. Thus for each of these variety
Elementary Sequence Analysis
130 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
of distributions the relative rate per site is constant. But unless α is very large, some sites in the sequence will have rates
well above the mean and some well below the mean.
All of the distance measures discussed in previous sections
can be corrected to include a gamma distribution for the The codes used!
distance measure. For example, the Jukes - Cantor correc- As you read papers in the scientific literature about genetic dis-
tion becomes tances you will read code that states they used the “HKY + Γ + I”
model. As explained in this chapter there are several models that
3 4 −1/α can be used to estimate distances. These range from simple to very
DJf C
= ( )α (1 − ( )D) −1 , complex. Since in general it is not good to over-parameterize a
4 3 model, a simpler model is preferred if it adequately fits the data.
4
To test this fit a series of hierarchical tests can be applied and have
−2(1/α+1)
V ar(DJf C
) = D(1 − D) (1 − ( )D) /n. been implemented by Posada & Crandall 1998.
3
The models that they test include
In general it would be desirable to estimate the value of the
JC Jukes and Cantor (1969)
gamma parameter α and this can be done easily (given the K80 Kimura (1980) (=K2P)
above interpretation) and it is done by some algorithms that HKY Hasegawa, Kishino, Yano (1985)
you might run across. However, it is also very common for TN Tamura and Nei (1993)
TNef Tamura-Nei equal frequencies
algorithms that include this correction to simply request a
K81 Two transversion-parameters model 1 (=K81=K3P)
value for α from the user. Studies of many amino acids se- (Kimura, 1981)
quences have suggested that often α < 2 and one program K81uf K81 with unequal frequencies
package uses a default value of α = 1. A typical example TIMef Transitional model equal frequencies
TIM Transitional model
of extreme variation would be the α = 0.47 that has been TVMef Transversional model equal frequencies
noted for some immunoglobulin genes (here much of the TVM Transversional model
variation is probably due to differential selection). Values SYM Symmetrical model (Zharkikh, 1994)
typical for your own applications will have to be calculated GTR General time reversible (=REV) (Tavare, 1986)
if you are using a program that requests supplied values. In addition to these, the rates can be gamma distributed, Γ, and the
model might include some sites that are considered invariant.
Hence you arrive at codes such as “HKY + Γ + I; an HKY model
7.1.6 Synonymous - nonsynonymous substi- with gamma distributed rates and invariant sites”.
tutions
Substitutions that result in amino acid replacements are said to be nonsynonymous while substitutions that do not cause an
amino acid replacement (such as a GGG codon to GGC codon change - both codons still encode glycine) are said to be
synonymous substitutions. Because of the difference in their effects on the physiology of the organism, synonymous and
nonsynonymous substitutions can have quite different dynamics. For example, synonymous substitutions usually occur at
a much faster rate than do nonsynonymous substitutions. Hence, for coding sequence it is often desirable to separate these
two.
The most common method to estimate these parameters separately is via an algorithm set out by Li, Wu & Luo (1985; Mol.
Biol. Evol. 2:150-174). It is somewhat complicated and I refer you to their paper for a complete description. Basically
it counts the number of sites that are potentially 4-way, 2-way or 0-way degenerate (the third position of a glycine codon
being 4-way degenerate, any second codon position being 0-way degenerate). It then counts the number of differences at
each site of each category keeping tract of transversions and transitions. It then calculates
KS and KA
the rate of synonymous and nonsynonymous substitutions. It has been found that KA can have large variation and great
changes between/within specific organisms. On the other hand, KS is generally less variable (though still shows more
variation than would otherwise be predicted) and shows less changes between/within organisms.
and different types of measures can be combined into one. Hence, distances can be used with restriction site data, with
allozyme data, with data on quantitative characters, with DNA fingerprints or even with real finger fingerprints. Methods
to correct this type of data are not well developed because these are not as well defined characteristics.
Even with amino acids, the corrections can not be done easily and/or without some large bias. A Jukes-Cantor correction
is possible. It is simply
19 20
DJC = −( ) ln(1 − ( )D),
20 19
But this assumes (as does the nucleotide Jukes-Cantor correction) that for all characters the rate of substitution from one
amino acid and to some other amino acid are equal and independent of the residue. This is not true of DNA and is even
less true for proteins. Amino acids like cysteine and proline are very important for the structure and function of proteins.
Amino acids such as tryptophan have bulky side groups and can not be inserted easily into any site in a peptide. Because
of this most amino acid distances use empirical weighting schemes. The most popular of these empirical measures is the
PAM family of matrices.
Table 7.1: The log odds matrix for PAM250 (multiplied by 10). The numbers in the lower left give the log odds. For the
diagram in the upper right, green/red circles are proportional to the odds of an interchange more/less likely than chance
alone.
C S T P A G N D E Q H R K M I L V F Y W
C 12 C
S 0 2 S
T −2 1 3 T
P −3 1 0 6 P
A −2 1 1 1 2 A
G −3 1 0 −1 1 5 G
N −4 1 0 −1 0 0 2 N
D −5 0 0 −1 0 1 2 4 D
E −5 0 0 −1 0 0 1 3 4 E
Q −5 −1 −1 0 0 −1 1 2 2 4 Q
H −3 −1 −1 0 −1 −2 2 1 1 3 6 H
R −4 0 −1 0 −2 −3 0 −1 −1 1 2 6 R
K −5 0 0 −1 −1 −2 1 0 0 1 0 3 5 K
M −5 −2 −1 −2 −1 −3 −2 −3 −2 −1 −2 0 0 6 M
I −2 −1 0 −2 −1 −3 −2 −2 −2 −2 −2 −2 −2 2 5 I
L −6 −3 −2 −3 −2 −4 −3 −4 −3 −2 −2 −3 −3 4 2 6 L
V −2 −1 0 −1 0 −1 −2 −2 −2 −2 −2 −2 −2 2 4 2 4 V
F −4 −3 −3 −5 −4 −5 −3 −6 −5 −5 −2 −4 −5 0 1 2 −1 9 F
Y 0 −3 −3 −5 −3 −5 −2 −4 −4 −4 0 −4 −4 −2 −1 −1 −2 7 10 Y
W −8 −2 −5 −6 −6 −7 −4 −7 −7 −5 −3 2 −3 −4 −5 −2 −6 0 0 17 W
C S T P A G N D E Q H R K M I L V F Y W
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 133
each term by the frequency of the replacement residue. Hence, each term now gives the probability of replacement, j to i
per occurrence of residue j.
By tradition the log10 of this matrix is used as weights (this is because to calculate the odds for the whole matrix requires
taking the product of changes for all sites of the protein. Before calculators it was easier to find the sum of the log’s rather
than the product sum). This log odds PAM 250 matrix is shown in Table 7.1 (also note that amino acids have been sorted
according to their similarity in this matrix).
Residue pairs with scores above 0 replace each other more often as alternatives in related sequences than in random
sequences. This can be an indication that both residues can carry out similar functions. A score exactly equal to zero
indicates amino acid pairs that are found as alternatives at exactly the frequency predicted by chance. Residue pairs with
scores less than 0 replace each other less often than in random sequences and might be an indication that these residues are
not functionally equivalent.
Some of the properties that are visible from this matrix and go into its makeup are - size, shape, local concentrations
of electric charge, conformation of van der Waals surface, ability to form salt bonds, hydrophobic bonds, and hydrogen
bonds. Interestingly, these patterns are imposed principally by natural selection and only secondarily by the constraints of
the genetic code. This tends to indicate that coming up with your own matrix of weights based on some logical features
may not be very successful because your logical features may have been over-written by other more important biological
considerations.
Some of the problems with this measure of distance are that it assumes that all sites are equally mutable. But this is clearly
false. Another problem is that by examining proteins with few differences, the highly mutable amino acids have been
stressed. Lastly, due to the collection of proteins known at that time, the matrix is biased because it is based mainly on
small globular proteins.
eij = p2i , if i = j
and
eij = 2pi pj , if i 6= j.
The odds matrix is qij /eij . Generally log’s are taken of this matrix to give a log(odds) or lod matrix such that
Table 7.2: The log odds matrix for BLOSUM62. The numbers in the lower left give the logs odds, while in the diagram to
the upper right, green/red circles are proportional to the odds of an interchange more/less likely than chance alone.
C S T P A G N D E Q H R K M I L V F Y W
C 9 C
S −1 4 S
T −1 1 5 T
P −3 −1 −1 7 P
A 0 1 0 −1 4 A
G −3 0 −2 −2 0 6 G
N −3 1 0 −2 −2 0 6 N
D −3 0 −1 −1 −2 −1 1 6 D
E −4 0 −1 −1 −1 −2 0 2 5 E
Q −3 0 −1 −1 −1 −2 0 0 2 5 Q
H −3 −1 −2 −2 −2 −2 1 −1 0 0 8 H
R −3 −1 −1 −2 −1 −2 0 −2 0 1 0 5 R
K −3 0 −1 −1 −1 −2 0 −1 1 1 −1 2 5 K
M −1 −1 −1 −2 −1 −3 −2 −3 −2 0 −2 −1 −1 5 M
I −1 −2 −1 −3 −1 −4 −3 −3 −3 −3 −3 −3 −3 1 4 I
L −1 −2 −1 −3 −1 −4 −3 −4 −3 −2 −3 −2 −2 2 2 4 L
V −1 −2 0 −2 0 −3 −3 −3 −2 −2 −3 −3 −2 1 3 1 4 V
F −2 −2 −2 −4 −2 −3 −3 −3 −3 −3 −1 −3 −3 0 0 0 −1 6 F
Y −2 −2 −2 −3 −2 −3 −2 −3 −2 −1 2 −2 −2 −1 −1 −1 −1 3 7 Y
W −2 −3 −2 −4 −3 −2 −4 −4 −3 −2 −2 −3 −3 −1 −3 −2 −3 1 2 11 W
C S T P A G N D E Q H R K M I L V F Y W
Hence if the observed number of differences between a pair of amino acids is equal to the expected number then sij = 0.
If the observed is less than expected then sij < 0 and if the observed is greater than expected sij > 0.
All of this gives the BLOSUM matrix. Different levels of the BLOSUM matrix can be created by differentially weighting
the degree of similarity between sequences. Sequences that belong to the same family (within a block) up to a critical level
of similarity are clustered so that they are treated as a single entry. For example, a BLOSUM62 matrix is calculated from
protein blocks such that if two sequences are more than 62% identical, then the contribution of these sequences is weighted
to sum to one. In this way the contributions of multiple entries of closely related sequences is reduced.
The BLOSUM62 matrix is given in Table 7.2. If the BLOSUM62 matrix is compared to PAM160 (it’s closest equivalent)
then it is found that the BLOSUM matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more
tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.
One of the significant disadvantages of the BLOSUM matrices is that they are not Markov chain matrices. Therefore
Veerassamy et al. 2003, J. Comput. Biol 10:997-1010 developed a probability transition matrix, based on the BLOSUM
matrices, that can be used in a Markov chain model. This is implimented as the PBM model in the PHYLIP package of
programs (see below).
Table 7.3: The log odds GONNET matrix. The numbers in the lower left give the log odds, while in the diagram to the
upper right, green/red circles are proportional to the odds of an interchange more/less likely than chance alone.
C S T P A G N D E Q H R K M I L V F Y W
C 11.5 C
S 0.1 2.2 S
T −0.5 1.5 2.5 T
P −3.1 0.4 0.1 7.6 P
A 0.5 1.1 0.6 0.3 2.4 A
G −2.0 0.4−1.1−1.6 0.5 6.6 G
N −1.8 0.9 0.5−0.9−0.3 0.4 3.8 N
D −3.2 0.5 0.0−0.7−0.3 0.1 2.2 4.7 D
E −3.0 0.2−0.1−0.5 0.0−0.8 0.9 2.7 3.6 E
Q −2.4 0.2 0.0−0.2−0.2−1.0 0.7 0.9 1.7 2.7 Q
H −1.3−0.2−0.3−1.1−0.8−1.4 1.2 0.4 0.4 1.2 6.0 H
R −2.2−0.2−0.2−0.9−0.6−1.0 0.3−0.3 0.4 1.5 0.6 4.7 R
K −2.8 0.1 0.1−0.6−0.4−1.1 0.8 0.5 1.2 1.5 0.6 2.7 3.2 K
M −0.9−1.4−0.6−2.4−0.7−3.5−2.2−3.0−2.0−1.0−1.3−1.7−1.4 4.3 M
I −1.1−1.8−0.6−2.6−0.8−4.5−2.8−3.8−2.7−1.9−2.2−2.4−2.1 2.5 4.0 I
L −1.5−2.1−1.3−2.3−1.2−4.4−3.0−4.0−2.8−1.6−1.9−2.2−2.1 2.8 2.8 4.0 L
V 0.0−1.0 0.0−1.8 0.1−3.3−2.2−2.9−1.9−1.5−2.0−2.0−1.7 1.6 3.1 1.8 3.4 V
F −0.8−2.8−2.2−3.8−2.3−5.2−3.1−4.5−3.9−2.6−0.1−3.2−3.3 1.6 1.0 2.0 0.1 7.0 F
Y −0.5−1.9−1.9−3.1−2.2−4.0−1.4−2.8−2.7−1.7 2.2−1.8−2.1−0.2−0.7 0.0−1.1 5.1 7.8 Y
W −1.0−3.3−3.5−5.0−3.6−4.0−3.6−5.2−4.3−2.7−0.8−1.6−3.5−1.0−1.8−0.7−2.6 3.6 4.1 14.2 W
C S T P A G N D E Q H R K M I L V F Y W
Their matrix is given in Table 7.3 and has been normalized to a PAM distance of 250. The matrix elements are ten times
the logarithm of the probability that the amino acids are aligned, divided by the probability that these amino acids would
be aligned by chance.
In addition they used these alignments to make an estimate of appropriate gap penalties. From this empirical data they
suggest that P , the probability of a gap of length k should follow a relation such that
This relation would give the most accurate answer but if the PAM distance is not available, they suggest
--GCAAACT
--GCAAGCC
ATGCTAGCC
Which pair of sequences has the smallest distance? If gaps are ignored then the second and third sequences are closest
with one difference. But if gaps are considered (and if each gapped position is counted as one) then the first and second
sequences are closest. If gaps are weighted differently then the answer might depend on the particular weighting.
Chapter 8
Database Searching
8.1.1 FASTA
To search through the whole genetic sequence database can take a great deal of time due to its enormous size. If some
operation must be performed on each sequence in turn then this can take even longer. One such example is to look
throughout the whole database for homologous or similar sequences. To do this, special programs have been developed to
speed the search. The first amongst these programs was a program called FASTA written by W.R. Pearson and D.J. Lipman
(1988, PNAS 85:2444-2448).
It is possible to run this program on remote machines. The obvious choice for such a remote machine would be one that has
access to the latest sequence information. Both EMBL and DDBJ have permitted this type of access and have implemented
FASTA type searches through their machines (NCBI prefers to use BLAST - see below).
There are several flavours to FASTA: fasta scans a protein or DNA sequence library for sequences similar to a query
sequence. tfasta compares a protein query sequence to a translated DNA sequence library. lfasta compares two
query sequences for local similarity between them and shows the local sequence alignments. plfasta compares two
sequences for local similarity and plots the local sequence alignments. Two recent flavours fastx and fasty (Pearson et
al. 1997 Genomics 46:24-36) permit comparison of a DNA sequence translated in all six frames to the protein databases.
The ‘x’ form takes a DNA query sequence and translates it in all frames and then permits gaps between the resulting amino
acids. The ‘y’ form more generally permits gaps within and between codons. The related tfastx and tfasty forms
compare a protein query sequence to a DNA database by translating the DNA database in all six frames.
I will illustrate what a FASTA type of search is and what the results look like with an example. Basically the idea is to
search through the complete database for any possible similar sequence.
8.1.1.1 Instructions
To carry out this type of search on the EMBL server the following must be done. Either point your web browser to FASTA3
and fill out the appropriate forms or set up a file containing the following
LIB UNIPROT
WORD 1
LIST 50
TITLE HALHA
Elementary Sequence Analysis
138 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
HISTOGRAM yes
SEQ
PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
DYLQNRVI
The first line contains the data library files to be searched (in this case all known protein entries). For protein searches this
field may be one of
The second line gives the word size or k-tuple value (more on this below). The third line says to LIST on the output the
top 50 scores. The TITLE line is used for the subject of the mail message. Finally SEQ implies that everything below this
line to the end of the message is part of the sequence. In this case the sequence is the protein sequence of the ferredoxin
gene of Halobacterium species NRC-1.
The remaining options are - LIST n, n top scores listed in the output [50]. ALIGN n, align the top n to the query sequence
[10]. ONE, compare only the given strand to the database, the default is to use the complementary strand as well. PROT
will force your query sequence to be a protein (small protein sequences may be otherwise misinterpreted as DNA). PATH
string mails the results back to string rather than the originator of the message.
After creating this file, mail the file by electronic mail to [email protected] and the results will be sent back to you by
electronic mail. Alternatively simply point your web browser to FASTA3 and fill in the forms (they have the same options).
Please, as a courtesy to others using the system please send only one job at a time. Many other people from all over the
world are using these servers and the FASTA program is quite computer intensive despite its speed.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 139
An example of the output is shown below. The input file is specifying the Halobacterium species NRC-1 ferredoxin amino
acid sequence to search the SWISS-PROT database.
FASTA searches a protein or DNA sequence data bank
version 3.4t23 March 18, 2004
Please cite:
W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
1>>>HALHA - 128 aa
vs UniProt library
opt E()
< 20 1429 0:=
22 30 0:= one = represents 2608 library sequences
24 50 1:*
26 246 33:*
28 917 356:*
30 3937 2160:*=
32 13990 8352:===*==
34 33145 22650:========*====
36 63533 46517:=================*=======
38 94343 76875:=============================*=======
40 124411 107234:=========================================*======
42 156468 131081:==================================================*=========
44 150019 144594:=======================================================*==
46 141925 147273:======================================================= *
48 131289 140997:=================================================== *
50 124662 128660:================================================ *
52 106819 113114:========================================= *
54 83834 96619:================================= *
56 69228 80707:=========================== *
58 58053 66259:======================= *
60 46296 53673:================== *
62 36284 43030:============== *
64 29298 34222:============ *
66 22040 27048:========= *
68 17474 21275:======= *
70 13385 16672:======*
72 10307 13028:====*
74 7966 10157:===*
76 6188 7906:===*
78 4631 6145:==*
80 3596 4772:=*
82 2939 3650:=*
84 2320 2891:=*
86 1609 2237:*
88 1387 1731:* inset = represents 13 library sequences
90 920 1339:*
92 642 1036:* :=======================================*
94 524 802:* :=======================================*
96 393 620:* :=============================== *
98 284 480:* :====================== *
100 257 371:* :==================== *
102 209 287:* :================= *
104 124 222:* :========== *
106 97 172:* :======== *
108 91 133:* :======= *
110 83 103:* :=======*
112 46 80:* :==== *
114 39 62:* :=== *
116 38 48:* :===*
118 25 37:* :==*
>120 604 29:* :==*=====================================
501934690 residues in 1568424 sequences
statistics extrapolated from 60000 to 1567839 sequences
Expectation_n fit: rho(ln(x))= 5.0598+/-0.000193; mu= 9.7296+/- 0.011
mean_var=58.3610+/-12.418, 0’s: 151 Z-trim: 257 B-trim: 877 in 1/64
Lambda= 0.167885
Kolmogorov-Smirnov statistic: 0.0646 (N=29) at 44
FASTA (3.47 Mar 2004) function [optimized, BL50 matrix (15:-5)] ktup: 1
join: 42, opt: 30, open/ext: -10/-2, width: 32
Scan time: 272.383
The best scores are: opt bits E(1568424)
Elementary Sequence Analysis
140 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
10 20 30 40 50 60
HALHA PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
UNIPRO PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
10 20 30 40 50 60
HALHA DYLQNRVI
::::::::
UNIPRO DYLQNRVI
10 20 30 40 50
HALHA PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDW
:::::::::..::.:::: :::.: .:.: :::::::..:: ::::::::::::::::
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 141
UNIPRO MPTVEYLNYEVVDDNGWDMYDDDVFAEASDMDLDGEDYGSLEVNEGEYILEAAEAQGYDW
10 20 30 40 50 60
60 70 80 90 100 110
HALHA PFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKH
:::::::::::::.:: ::.:::::::::::::::.:.::::::::: ::::::::::::
UNIPRO PFSCRAGACANCAAIVLEGDIDMDMQQILSDEEVEDKNVRLTCIGSPDADEVKIVYNAKH
70 80 90 100 110 120
120
HALHA LDYLQNRVI
:::::::::
UNIPRO LDYLQNRVI
10 20 30 40 50 60
HALHA PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
:::::::::..::.:::: :::.: .:.: :: ::::..:: :::::::::::::::::
UNIPRO PTVEYLNYEVVDDNGWDMYDDDVFGEASDMDLDDEDYGSLEVNEGEYILEAAEAQGYDWP
10 20 30 40 50 60
HALHA DYLQNRVI
::::::::
UNIPRO DYLQNRVI
10 20 30 40 50
HALHA TVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYG---TMEVAEGEYILEAAEAQGYD
: .:..: :.:: . ::::..:: .: :
UNIPRO ASYKVTLINEEMGLNETIEVPDDEYILDVAEEEGID
10 20 30
60 70 80 90 100 110
HALHA WPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAK
:.:::::::..::. .::::::.. :..:.:...: : :::.. ::.: . :... .
UNIPRO LPYSCRAGACSTCAGKIKEGEIDQSDQSFLDDDQIEAGYV-LTCVAYPASDCTIITHQEE
40 50 60 70 80 90
120
HALHA HLDYLQNRVI
.:
UNIPRO ELY
......................................................................
....................... Material Deleted ...........................
......................................................................
10 20 30 40 50 60
HALHA ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
:.. . :::.::: : : :.:::::::
UNIPRO ATYKVTLISEAEGINETIDCDDDTYILDAAEEAGLDLPYSCRAGAC
10 20 30 40
initn: 212 init1: 176 opt: 224 Z-score: 297.1 bits: 60.8 E(): 1.5e-08
Smith-Waterman score: 224; 39.394% identity (41.053% ungapped) in 99 aa overlap (23-121:59-153)
10 20 30 40 50
HALHA PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAA
...:. .: :: .. .:. . ::::::
UNIPRO NTLSFAGHARQAARASGPRLSSRFVASAAAVLHKVKLVGPDGTEH-EFEAPDDTYILEAA
30 40 50 60 70 80
60 70 80 90 100 110
HALHA EAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVK
:. : . :::::::.:..::. .. ::.:.. ..:.: .. : . ::::. : :: :
UNIPRO ETAGVELPFSCRAGSCSTCAGRMSAGEVDQSEGSFLDDGQMAEGYL-LTCISYPKADCV-
90 100 110 120 130 140
120
HALHA IVYNAKHLDYLQNRVI
... :. :
UNIPRO -IHTHKEEDLY
150
10 20 30 40 50 60
HALHA TLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACA
.:: . :::. :: .: : :.:::::.:.
UNIPRO ATYKVKLVTPEGEVELEVPDDVYILDQAEEEGIDLPYSCRAGSCS
10 20 30 40
The textual output as shown above is only one possible output available. In addtition to the textual output, you can request
an MVIEW (a mulitple alignment view) as in Figure 8.1 or a visual fasta view (a graphical version of the signficance) as
in Figure 8.2.
The textual output from the FASTA search begins with some informational messages. This includes the reference that you
should cite, the version number of the program and the libraries that were searched. In this case, an optional histogram
(lying on its side) has been requested of the number of sequences found with various scores. Each equal symbol in this
histogram is an indicator of 2608 sequences and the asterisk indicates the expected number. The tail of the distribution is
expanded in the inset. Here each equal symbol represents 13 sequences. This histogram gives you an indication of how
similar the query sequence is to some of the database sequences. For a query sequence that has found a significant match,
it should be well out of the tail of the distribution. In this example there are many sequences with scores larger than 120
and they are more frequent than expected by chance. These are related ferredoxin sequences from other species.
Next comes some information about the size of the database searched (note the size of the numbers) and some statistics
about the search. Next comes a section that lists the sequences (along with their locus names) that have the best scores.
Finally there is a section that lists the alignments that have been found by the program.
To carry out a database search in this manner, the algorithm first establishes a table containing words from the database
sequences of variable length (e.g. ATCGGA, ACCCTG, GTCACA, . . . for nucleotides or MK, RS, CP, . . . for proteins).
This type of preprocessing of the entire database is necessary to speed the subsequent search. This table is then sorted in
alphabetical order and allows matching words (from the query sequences) to found rapidly. The length of these words is
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 143
Figure 8.1: The MVIEW output from http://www.ebi.ac.uk/fasta for the ferredoxin data
Figure 8.2: The VISUALFASTA output from http://www.ebi.ac.uk/fasta for the ferredoxin data
Elementary Sequence Analysis
144 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
set by the WORD or k-tuple parameter value. By default it is 6 for nucleic acids and 2 for amino acid searches. A lower
k-tuple will give a more sensitive search but will take much longer. Although a range of 3 to 6 is permitted for nucleic
acids a lower value is generally unnecessary. All places in the query sequence are determined where the k-tuple from both
sequences agree perfectly. Then those regions with the highest density of these identities are found.
In comparing a query sequence to the database three scores are calculated for each and every entry in the database. These
scores are init1, initn and opt. An init1 score is assigned to each of these regions of high similarity after the regions are
extended at the ends to include regions shorter than the length of a k-tuple and after using a BLOSUM50 matrix (alternative
distance matrices are available – more on these later) to score mismatches.
Groups of larger regions are attempted to be joined together and an initn score is generated from these. This is done by
setting initn equal to the sum of the two init1 scores for each region (the final init1 score of a sequence is the maximum
init1 score from all interior regions). A constant of 20 is then subtracted as a joining penalty. If the initn score is less than
one of the init1 scores it is discarded, the regions are not joined and the initn score will be equal to the maximum init1
score (hence initn is greater than or equal to init1).
Sequences that have an initn score larger than a cutoff value (usually 50 but this can be altered with a “LIST n” command in
the query file) are then used for a Smith-Waterman alignment (see the section on alignments) and an opt score is generated
from these alignments. Only the region considered significant by the program is displayed. In these alignments, the name
of the sequence will be presented, the scores, and the percent similarity over the region aligned. In general the length of
the region aligned is a better indicator of homology than is the percent similarity. This is because large percentages can be
found in short regions just by chance. A ‘:’ is used to indicate a complete match, a ‘.’ to indicate a conservative amino acid
replacement, and a ‘-’ to indicate a deletion/insertion.
Note that the opt score can be lower than the initn score. This will happen when one sequence has two (or more) regions of
high similarity separated by regions that have little/no homology. The two regions are joined with high init1 scores and the
initn score is high because the gap penalty/join penalty is not sufficiently large. In contrast sequences with a large number
of poorly similar regions will have low init1 scores but high initn scores and then low opt scores. In general, unless a very
short sequence is used, the init1 score should be much improved by the opt score for truly significant sequences. Lastly
a z-score based on estimates of the statistical significance of the opt scores is presented. This estimates the probability of
obtaining opt scores as good or better by chance between unrelated sequences (see below).
Remember to remove repetitive sequences from your query otherwise you will get a lot of false hits. The FASTA program
itself can be obtained via anonymous ftp if desired.
Since version 2.0 of the FASTA program distribution, FASTA, TFASTA, and SSEARCH will provide estimates of statistical
significance for library searches. Work by Altschul, Arratia, Karlin, Mott, Waterman, and others (see Altschul et al. 1994
Nature Genetics 6:119-129 for an excellent review) shows that local sequence similarity scores follow an extreme value
distribution. The probability of a database match score larger than x arising by chance alone is therefore
−λ(x−u)
P (s ≥ x) = 1 − e−e
which shows that the probability of observing larger scores for unrelated library sequences increases logarithmically with
the length of the library sequence (Pearson - FASTA documentation).
FASTA and SSEARCH produce gapped alignments and hence use a simple linear regression against the log of the library
sequence length to calculate a normalized “z-score” with mean 50, regardless of library sequence length, and variance 10.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 145
These z-scores can then be used with the extreme value distribution and the poisson distribution (to account for the fact that
each library sequence comparison is an independent test) to calculate the expected number of library sequences required
to obtain a score greater than or equal to the score obtained in the search (Pearson - FASTA documentation).
The expected number of sequences is plotted in the histogram using an ‘∗’. Since the parameters for the extreme value
distribution are not calculated directly from the distribution of similarity scores, the pattern of ‘∗’s in the histogram gives
a qualitative view of how well the statistical theory fits the similarity scores calculated by FASTA and SSEARCH. For
FASTA, optimized scores are calculated for each sequence in the database and the agreement between the actual distribution
of “z-scores” and the expected distribution based on the length dependence of the score and the extreme value distribution is
usually very good. Likewise, the distribution of SSEARCH Smith-Waterman scores typically agrees closely with the actual
distribution of “z-scores.” The agreement with unoptimized scores, ktup = 2, is often not very good, with too many high
scoring sequences and too few low scoring sequences compared with the predicted relationship between sequence length
and similarity score. In those cases, the expectation values may be overestimates (Pearson - FASTA documentation).
The statistical routines assume that the library contains a large sample of unrelated sequences. If this is not the case,
then the expectation values are meaningless. Likewise, if there are fewer than 20 sequences in the library, the statistical
calculations are not done (Pearson - FASTA documentation).
The online FASTA - nucleotide / FASTA - protein help at EBI can be consulted for further information.
8.1.2 BLAST
While FASTA is a sensitive and rapid algorithm to search for similar sequences in the database it is not without problems.
Because its initial step looks for perfect matches it might be less sensitive to more distantly related sequences that have
functional homology but no longer retain complete identity. If an amino acid sequence has had many conserved replace-
ments but no longer has identities then the FASTA algorithm might not identify these as well as it should. Fortunately,
alignments where there are extensive regions of low but not exact similarity are rare enough that a small WORD or k-tuple
size will pick up most regions.
A different algorithm which improves upon FASTA in speed is termed BLAST (Basic Local Alignment Search Tool). This
began with a statistical paper by Karlin and Altschul (PNAS 87:2264-2268, 1990) who developed a rigorous method to
obtain the probabilities of matches with a query sequence given that no gaps are permitted. This permits the use of larger
WORD or k-tuple sizes with the concomitant increase in speed but permitting inexact matches between WORDs. The
statistical developments permit this to be done without loss of sensitivity and allow rigorous statistical statements to be
made about the matches found.
As a result of these developments Altschul, Gish, Miller, Myers and Lipman (J.Mol.Biol. 215:403-415, 1990) created
the BLAST group of programs. These algorithms find ungapped, locally optimal sequence alignments. There are several
versions of the BLAST programs. Some are
The last two use a different algorithm than does BLASTN. The program MEGA-BLAST uses a “greedy algorithm” for
nucleotide sequence alignment search and is designed to find sequences that differ slightly from the query sequence.
Hence is best at identifying something “similar” in the database without concern about distant homologies. It is much
faster than BLASTN and by default uses a much large k-tuple. The program discontiguous MEGA-BLAST increases
Elementary Sequence Analysis
146 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
sensitivity to diverged sequences by using a discontiguous word as the initial match from which extensions are performed
(see below).
To carry out this type of search go to the NCBI BLAST web server, select the desired program and fill out the forms.
Most of the options will take standard default values. The database for example, has a default of “nr”. This means that it
will search the non-redundant database (it includes sequences from PDB, GenBank, GenBank updates, EMBL and EMBL
updates or sequences from PDB, SWISS-PROT, PIR, GenPept and GenPept updates) but there are many others that can be
chosen instead. In addition you can chose to search only specific groups of organisms or to search sequences that originated
from only one organism. Filter’s will mask parts of your query so that things like repetitive elements are ignored (filter seq
- will exclude regions of low compositional complexity, filter dust - is a modernized filter version that at the time of this
writing has not yet been described in the literature. Other filter’s will exclude regions with repetitive elements). It is also
possible to select the number of DESCRIPTIONS n, the number of described matching sequences [100]. ALIGNMENTS
n, number of high scoring pairs [50], the EXPECT n, the score such that n sequences should be found by chance alone [10]
(a fractional value of one or less will give only output which is statistically unusual, larger values give more output) and
the WORD size used for initial matches. Other options are available.
More information about the programs and their output can be obtained from NCBI’s BLAST site including a
• BLAST overview
• BLAST FAQs
• BLAST course
• BLAST handbook
The BLAST programs themselves can be obtained if desired by anonymous ftp to NCBI (with more options possible
(and permissible)) and if desired, a network client that works directly through TCP/IP connections (hence, no web browser
required) can be obtained as BLASTcl3 from the ftp site.
Typical BLAST output appears as in Figure 8.3 (this search was done on Jan 19 2002 with an APRT gene from Mus pahari
as the query).
Each of the blue-highlighted pieces of text are links that leads directly to the entry in the database that matches the query.
There is a diagram at the top of the entry that graphically demonstrates the hits and how they align with the query sequence.
It is colour coded according to the statistical level of the match. In this diagram regions of low match are in gray hatch-
marks. Note that even though the query sequence is in the database, there are these hatch-marks in the first matching
sequence. This is because these sequence regions contain low complexity DNA (e.g. J.C. Wootton, 1994 Comput Chem
18:269-285) that would disrupt the statistical measures of similarity and hence they have been excluded by default from
the match (this behaviour can be altered ... see above).
After the listing of hits comes a section that lists the match between the query sequence and the database match.
Alignments
......................................................................
....................... Material Deleted ...........................
......................................................................
Query: 1 aagcttgtgctaaacaactgctgtataccaggctccatgcttgagcttcagaaacaccct 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 862 aagcttgtgctaaacaactgctgtataccaggctccatgcttgagcttcagaaacaccct 921
......................................................................
....................... Material Deleted ...........................
......................................................................
......................................................................
....................... Material Deleted ...........................
......................................................................
Note that this alignment might be in pieces as demonstrated above even for the database entry which is a perfect match.
Further down the listing will be generally shorter matches such as . . .
>gi|17221275|emb|AL645588.7|AL645588 Mouse DNA sequence from clone RP23-452K19 on chromosome 11, complete
sequence [Mus musculus]
Length = 5004
>gi|12849531|dbj|AK012648.1|AK012648 Mus musculus 10, 11 days embryo whole body cDNA, RIKEN full-length
enriched library, clone:2810002N01:related to Y39B6B.P
PROTEIN, full insert sequence
Length = 1026
and finally at the bottom of the entry will be some statistics about the search . . .
Elementary Sequence Analysis
150 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Lambda K H
1.37 0.711 1.31
Gapped
Lambda K H
1.37 0.711 1.31
The program output consists of three parts. The first part is a graphical diagram of the top matches to the query sequence.
The second is a listing of the best matches (along with links to their database entries), their scores and their E value. The
E-value is a estimate of how many matches as good or better would occur by chance alone in a database of this size. The
third part is an alignment of the matches with the query sequence. The fourth part of the output will be a listing of the
parameters used and some statistics of the search. Some of these parameters can be changed (see the documentation for
more information) but others cannot be changed. NCBI is aware of the tradeoffs in speed versus sensitivity and attempts
to offer a service with the most sensitive parameter settings that its machines can handle.
Remember that BLAST will find matches of ungapped strings. There may be more than one “ungapped” region that give
an unusually large score. These multiple regions are not ignored but rather attempts are made to put them together to yield
a lower overall probability. The statistics for the ungapped strings are well worked out, but the statistics for gapped matches
are still not well understood.
The BLAST algorithms are capable of speeding through the entire databases within just a few seconds. Its speed is
impressive. BLAST requires time proportional to the product of the query sequence length and the length of the database.
The databases are growing far more quickly than are improvements in the speed of the computers or in the design of the
algorithms.
The particular example shown above is a search of the database for homologues to the Mus pahari APRT sequence. You
will note that the algorithm has done a good job at finding these homologues. The next match is the APRT gene of Mus
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 151
spicilegus (a closely related species – with a correspondingly closely related APRT sequence) and not surprisingly it has
an expect value of 1 × 10−119 . That is, in a database of this size you expect to see 1 × 10−119 other matchs as as good or
better than this one just by chance (effectively zero).
An older search of the same sequence found the same matches and if you continued down the list you would see . . .
Smallest
Sum
High Probability
Sequences producing High-scoring Segment Pairs: Score P(N) N
......................................................................
....................... Material Deleted ...........................
......................................................................
So as you go down the list you find more APRT genes but also, later on, some oncogene – c-abl. So now you get all
excited — we have discovered a new class of genes involved in cancer! Major advance . . . international acclaim, . . . Nobel
Prize!! But wait, we must be cautious here, what do the statistics say. Well for this c-abl gene the match has a probability
of 3.6 × 10−27 of occurring by chance alone. So we are home free, that is significant in anyone’s statistic book. But no,
life is seldom so exciting. As you continue to scan the list, you find cytochrome C, membrane proteins, growth factors, and
all sorts of other genes all with apparent significant homology to the query sequence. What is going on?
Remember that BLAST (and any of the other algorithms) search for similarity not of the entire sequence but rather for any
piece of the query sequence. Examining the regions of significant match between the database sequences and the query
sequence indicates that these are consistently from approximately nucleotides 302 to 431 but not generally outside of this
region. This region encodes a very common SINE element in rodents. Hence there is no similarity of the query gene to all
these other genes but there is a significant similarity of the B2 SINE element that is inserted into the APRT gene and the
B2 SINE elements that have been inserted into the other gene sequences. Be careful of the interpretation of your results —
no Nobel prize this time.
Occassionally, other features such as a coiled-coil region or transmembrane regions will cause falsely positive matches to be
predicted. In addition, although not a false match, the results of exon shuffling can copy a motif from one protein to another
and might lead one to consider that the entire lengths of these two proteins are homologous (and derived evolutionarily
from the other) when it is really only the motif that is similar. Sometimes, functional requirements will cause selection to
pick on a pattern of amino acids that are similar again without homology.
Another common misuse of BLAST is to search for the most similar sequence to some query sequence. But the algorithm
is designed to find similar ungapped subsequences, and to then piece these together. The order in which these sequences
are ranked by score may not correspond to the order of overall similarity of the complete sequences, and certainly may
not correspond to the phylogenetic history of these sequences (Koski & Golding 2001, J.Mol.Evol. 52:540-542). Thus a
sequence with a higher score may not be more ‘similar’ to the query sequence than another sequence with a lower score
(more later on what is meant by similarity). It is quite possible for the overall similarity to be greater for a sequence with
a lower BLAST score. A sequence may also be more closely related in terms of history to the query than some other
sequence with a lower score.
Elementary Sequence Analysis
152 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
8.1.3 MPsrch
Discontinued
The MPsrch server at EBI runs on an HP/COMPAQ computer cluster. It uses the Smith-Waterman local similarity algo-
rithm (see section 6.2.2 for a description of this alignment algorithm) to compare the query sequence versus the Swiss-Prot
database. The advantage of this algorithm is that “is recognised as the most sensitive sequence comparison method avail-
able, whereas BLAST and FASTA utilise a heuristic one. As a consequence, MPsrch is capable of identifying hits in
cases where Blast and Fasta fail and also reports fewer false-positive hits.”. It will only run searches for proteins and not
for nucleotides due to the time involved but also due to the discreteness of proteins. The speed achieved by MPsrch, is
mainly that it is running on a “massively” parallel computer. Because of the use of a parallel computer, it was claimed that
“MPsrch is the fastest implementation of the SW algorithm currently available on any machine”. Many molecular biology
problems lend themselves to parallel architecture computers. For many problems, intermediate steps can be effectively cal-
culated without the need to know results from previous steps. Each of these independent steps can be given to a different
processor and solved on its own. Special software has been developed for parallel computers to manage communication
among individual processors and to delegate jobs to each one.
The input sequence
PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
DYLQNRVI
was given to the website of MPsrch at http://www.ebi.ac.uk/MPsrch/index.html. The input webpage for MPsrch is shown
in Figure 8.4. It provides several options that you should explore. Note in particular the database search options. In the
example used below I selected the database UNIPROT but for initial explorations you should try UNIREF## databases.
These eliminate proteins that are within ## percentage of similarity (where ## is 100, 90 or 50). This will speed an already
rapid search. .
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 153
***************************************************************************
***************************************************************************
Aneda Limited.
Release 4.2.80 Copyright (c)
Run on: Tue Aug 9 16:25:37 2005; Search Time 4.80 secs
1690.21 Million Cell updates/sec
Title: >Sequence
Description: No description found
Perfect Score: 1125
Sequence: 1 ptveylnyetlddqgwdmdddd.......adevkivynakhldylqnrvi 128
Scoring table: PAM 100
Gap 14
Database: uniprot
1 uniprot_sprot
2 uniprot_trembl1
3 uniprot_trembl2
SUMMARIES
%
Result Query
No. Score Match Length DB ID Description Pred. No.
---------------------------------------------------------------------------
1 1125 100.0 128 1 FER_HALSA Ferredoxin. 3.48e-265
2 978 86.9 129 2 Q9YGB6_HALJ Ferredoxin. 4.74e-225
3 967 86.0 128 1 FER1_HALMA Ferredoxin 1. 4.67e-222
4 546 48.5 138 1 FER2_HALMA Ferredoxin 2. 4.52e-109
5 319 28.4 98 1 FER_SYNP4 Ferredoxin. 8.96e-51
6 305 27.1 98 1 FER1_ANAVA Ferredoxin I. 2.59e-47
7 302 26.8 98 1 FER1_ANASP Ferredoxin I. 1.42e-46
8 302 26.8 98 1 FER1_ANASO Ferredoxin I. 1.42e-46
9 301 26.8 97 1 FER_SYNEL Ferredoxin I. 2.50e-46
10 301 26.8 97 1 FER_SYNEN Ferredoxin I. 2.50e-46
11 301 26.8 97 1 FER_SYNVU Ferredoxin I. 2.50e-46
12 298 26.5 96 1 FER_SYNLI Ferredoxin. 1.37e-45
13 297 26.4 98 1 FER_NOSMU Ferredoxin. 2.41e-45
14 294 26.1 98 1 FER2_NOSMU Ferredoxin II. 1.31e-44
15 290 25.8 99 1 FER1_PLEBO Ferredoxin I (FdI). 1.25e-43
16 290 25.8 99 2 Q7DK20_PLEB Ferredoxin. 1.25e-43
17 288 25.6 99 3 Q7U8S7_SYNP Ferredoxin. 3.84e-43
18 286 25.4 99 3 Q7VAM6_PROM Ferredoxin. 1.18e-42
19 284 25.2 99 3 Q7V0B6_PROM Ferredoxin. 3.64e-42
20 282 25.1 98 2 Q7M191_SYNS Ferredoxin. 1.12e-41
ALIGNMENTS
(Maximum 20 Alignments, Predicted No. Cut-off is OFF )
RESULT 1
ID FER_HALSA STANDARD; PRT; 128 AA.
DE Ferredoxin.
Elementary Sequence Analysis
154 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
************************************************************
Db 1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60
Qy 1 ptveylnyetlddqgwdmddddlfekaadagldgedygtmevaegeyileaaeaqgydwp 60
************************************************************
Db 61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120
Qy 61 fscragacancasivkegeidmdmqqilsdeeveekdvrltcigspaadevkivynakhl 120
********
Db 121 DYLQNRVI 128
Qy 121 dylqnrvi 128
RESULT 2
ID Q9YGB6_HALJP PRELIMINARY; PRT; 129 AA.
DE Ferredoxin.
********
Db 122 DYLQNRVI 129
Qy 121 dylqnrvi 128
RESULT 3
ID FER1_HALMA STANDARD; PRT; 128 AA.
DE Ferredoxin 1.
********
Db 121 DYLQNRVI 128
Qy 121 dylqnrvi 128
RESULT 4
ID FER2_HALMA STANDARD; PRT; 138 AA.
DE Ferredoxin 2.
RESULT 5
ID FER_SYNP4 STANDARD; PRT; 98 AA.
DE Ferredoxin.
***.. **.*
Db 77 -LTCVAYPASD 86
Qy 99 rltcigspaad 109
RESULT 6
ID FER1_ANAVA STANDARD; PRT; 98 AA.
DE Ferredoxin I.
***.. *..* * *
Db 77 -LTCVAYPTSD-VTI 89
Qy 99 rltcigspaadevki 113
......................................................................
....................... Material Deleted ...........................
......................................................................
RESULT 20
ID Q7M191_SYNSP PRELIMINARY; PRT; 98 AA.
DE Ferredoxin.
***.. *..*
Db 77 -LTCVAYPTSD 86
Qy 99 rltcigspaad 109
This particular search took only 5 seconds of CPU time and a total of 101 seconds including input/output. This speed is
a great improvement over that achieved by the FASTA algorithm. The web page output is shown in Figure 8.5. . This
algorithm is as fast as BLASTP and in addition, it should also give a more sensitive search for distant homologies.
The mean and variance of the distribution of scores from the entire database are calculated. These are used to construct
empirical statistics of the predicted number of random matches in the database equal to or better than that found. The
algorithm then lists the best scores (50 of them here, the default for NAMES) and then lists more detailed reports for a
subclass of these (30 here, the default for ALIGN). For each it calculates the raw score, the percent matches, the predicted
number expected, the number of matches, the number of mismatches, the number of partial matches (residue pairs with
a positive score in the PAM matrix), the number of indels and the number of gaps. This program considers these two
differently in that a single gap can be composed of any number of adjacent indels.
In this case all hits have very small “pred. no.” numbers indicating that they each have statistically significant homology
to the ferredoxin query sequence (not too surprising since they are all different ferredoxins). Also note that the Smith-
Waterman alignment algorithm does a best local alignment (more on this later) so the entire query sequence may not be
presented in the output. Rather the part of the sequence that has a good alignment with the database entry is shown. The
sequence is not aligned for regions where the significance of the alignment begins to decline. Hence in the example above,
for the alignment to result #20, only amino acids 17 through 86 from the database sequence and amino acids 39 to 109 from
Elementary Sequence Analysis
156 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
the query sequence are shown in the alignment, even though the query protein is 128 amino acids in length. The sequence
prior to amino acid 17/39 and after amino acid 86/109 are not considered to be part of the significant local alignment.
8.2 BLOCKS
The FASTA and BLAST servers are often searched for homologues in order to identify the query sequence. The BLOCKS
server at http://blocks.fhcrc.org is designed to identify chunks of a protein that my encode some function. The BLOCKS
server is thus somewhat related to the other servers mentioned above (and hence included here) but is designed to answer a
different question. Instead of looking for similar sequences in the databases, it scans a database of protein motif signatures
constructed from the INTERPRO database (a collection of of protein families, domains and functional sites found in known
proteins that can be applied to explore unknown protein sequences). In this way, BLOCKS will search a query sequence
(must be protein or optionally, it will translate your nucleotide sequence to a protein) for similar protein motifs in known
proteins. Blocks are defined as short ungapped (but potentially with variable length) segments of highly conserved regions
of proteins. As of August 2003 the BLOCKS database website reports that it consists of 8656 block patterns (version 13.0,
Aug 2001). This search is particularly useful for analysing distantly related proteins.
The web form to search the BLOCKS database is located at http://blocks.fhcrc.org/blocks/blocks search.html (References
should cite S.Henikoff & J.Henikoff, 1991 Nucl.Acids.Res. 19:6565-6572). Again simply supply the web page with your
query sequence.
Since this search only makes sense for proteins, if a nucleotide sequence is supplied to the server, it will be translated
in all frames. But a nucleotide sequence with IUBPAC ambiguity codes will be interpreted as a protein and will remain
untranslated.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 157
> Ferredoxin
GIDPNYRTHKPVVGDSSGHKIYGPVESPKVLGVHGTIVGVDFDLCIADGSCITACPVNVF
QWYETPGHPASEKKADPVNQQACIFCMACVNVCPVAAIDVKPP
The BLOCKS output begins with a lengthy informational message that I have deleted and then continues with the guts of
the message.
Hits
Query=Ferredoxin n
Size=103 Amino Acids
Blocks Searched=11182
Alignments Done= 1439343
Cutoff combined expected value for hits= 1
Cutoff block expected value for repeats/other= 1
==============================================================================
Combined
Family Strand Blocks E-value
PR00353 4Fe-4S ferredoxin signature 1 2 of 2 1.2e-06
PR00354 7Fe ferredoxin signature 1 1 of 3 0.00025
IPB000985 Legume lectin alpha domain 1 1 of 7 0.58
==============================================================================
>PR00353 2/2 blocks Combined E-value= 1.2e-06: 4Fe-4S ferredoxin signature
Block Frame Location (aa) Block E-value
PR00353A 0 76-87 4.5
PR00353B 0 88-99 0.00014
Other reported alignments:
------------------------------------------------------------------------------
>PR00354 1/3 blocks Combined E-value= 0.00025: 7Fe ferredoxin signature
Block Frame Location (aa) Block E-value
PR00354C 0 78-95 0.00027
Other reported alignments:
PR00354C 0 40-57 0.0018
------------------------------------------------------------------------------
>IPB000985 1/7 blocks Combined E-value= 0.58: Legume lectin alpha domain
Block Frame Location (aa) Block E-value
IPB000985D 0 36-45 0.67
Other reported alignments:
------------------------------------------------------------------------------
In this case, for ferredoxin, the program returns three possible hits. These are a 4Fe ferredoxin, a 7Fe Ferredoxin and a
legume lectin alpha domain. The first signature consists of two parts (two blocks), the signature for the second hit consists
of three parts (but only one was found in the query sequence) and the signature for the third hit consists of seven parts (but
again only one is present in the query sequence). Each of these blocks is labelled A, B, C, etc. The E-values are calculated
(as per the BLAST searches) to represent the expected number of hits with as good a similarity or better in a database of
this size. Hence the last hit to a legume lection alpha domain is probably just noise.
After this initial presentation, the program returns a diagram of hits. So in the first hit, the first block (A), can typically
begin anywhere from the 1st to the 571st amino acid (in bone-fide proteins with this signature). In our query it begins at
position 75. The second block (B), can occur anywhere from 0 to 338 amino acids distant from the first block. In our query
sequence it is 0 amino acids away. Alignments of each of these blocks to a best match is shown.
For the second hit, the query contains two possible locations for the “C” block but non of the other blocks. For the third
hit, only the “D” block.
Block PR00353A
ID 4FE4SFRDOXIN; BLOCK
AC PR00353A; distance from previous block=(1,571)
DE 4Fe-4S ferredoxin signature
BL adapted; width=12; seqs=171; 99.5%=733; strength=1118
P81293 ( 275) YVIDEDLCIGCR 17
FER_CLOSP|P00197 ( 30) RVIDADKCIDCG 21
O27769 ( 62) VVILEDRCIGCG 41
O28894 ( 233) TYVDWDKCIGCG 30
FER_CLOAC|P00198 ( 30) YVIDADTCIDCG 15
FER_BACSC|Q45560 ( 32) YYIDPDVCIDCG 26
Q59575 ( 147) IEIDKDTCIYCG 18
FER2_DESDN|P00211 ( 5) VIVDSDKCIGCG 21
O30081 ( 6) IAIDEEKCIGCG 18
O74028 ( 147) IEIDKDTCIYCG 18
FDXH_HAEIN|P44450 ( 132) VDFQSDKCIGCG 55
O26505 ( 164) AVVDESICIGCG 26
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 159
......................................................................
........................ Material Deleted ...........................
......................................................................
O29066 ( 9) FVHDRRKCIGCY 81
Q03195 ( 48) AFISEILCIGCG 65
FER1_RHOCA|P16021 ( 2) MKIDPELCTSCG 48
O28624 ( 73) LIVDESLCVGCG 20
P73811 ( 77) IVIDDQSCVDCG 41
Q46606 ( 145) VVRDMGKCIRCL 78
Y719_METJA|Q58129 ( 55) PVISEVLCSGCG 63
O28573 ( 62) AVVNYNYCKGCG 28
O27592 ( 556) YMIDPEKCDGCM 92
P74022 ( 141) FGIDHNRCILCT 59
//
Block PR00353B
ID 4FE4SFRDOXIN; BLOCK
AC PR00353B; distance from previous block=(0,338)
DE 4Fe-4S ferredoxin signature
BL adapted; width=12; seqs=171; 99.5%=728; strength=1179
P81293 ( 318) ACARECPVGAIK 11
FER_CLOSP|P00197 ( 42) ACANTCPVDAIV 11
O27769 ( 74) LCRDACPVGAIT 17
O28894 ( 312) PCEKACPTGAIN 13
FER_CLOAC|P00198 ( 42) ACAGVCPVDAPV 15
FER_BACSC|Q45560 ( 44) ACEAVCPVSAIY 17
Q59575 ( 313) ACERSCPVNAIE 11
FER2_DESDN|P00211 ( 47) SCIEVCPQNAIV 20
O30081 ( 18) RCVNSCPTGALV 16
O74028 ( 313) ACERSCPVTAIT 21
FDXH_HAEIN|P44450 ( 180) ACVKTCPTGAIR 12
O26505 ( 213) VCEENCPTGAIR 17
FER_CLOTM|P07508 ( 42) ACANVCPVDAPQ 14
......................................................................
........................ Material Deleted ...........................
......................................................................
This provides a short description of the parts of each block and then representative sequences that contain these blocks (with
a links to that sequence, the position of the first residue in the block, the block and a weighting score). This information
can be seen in graphical format as shown in Figure 8.6.
In addition you can get more data about the blocks through the PROSITE database link for this entry
PROSITE: PS00198
ID 4FE4S_FERREDOXIN; PATTERN.
AC PS00198;
DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); JUL-1998 (INFO UPDATE).
DE 4Fe-4S ferredoxins, iron-sulfur binding region signature.
PA C-x(2)-C-x(2)-C-x(3)-C-[PEG].
NR /RELEASE=41.21,133312;
NR /TOTAL=523(348); /POSITIVE=482(318); /UNKNOWN=2(2); /FALSE_POS=39(28);
NR /FALSE_NEG=16; /PARTIAL=5;
CC /TAXO-RANGE=A?EP?; /MAX-REPEAT=6;
CC /SITE=1,iron_sulfur; /SITE=3,iron_sulfur; /SITE=5,iron_sulfur;
CC /SITE=7,iron_sulfur;
DR P37127, AEGA_ECOLI, T; P26474, ASRA_SALTY, T; P26476, ASRC_SALTY, T;
DR P31894, COOF_RHORU, T; Q49161, DCA1_METMA, T; Q49163, DCA2_METMA, T;
DR Q57617, DCMA_METJA, T; P26692, DCMA_METSO, T; O27743, DCMA_METTH, T;
DR P08066, DHSB_BACSU, T; Q09545, DHSB_CAEEL, T; P48932, DHSB_CHOCR, T;
DR P51053, DHSB_COXBU, T; P48933, DHSB_CYACA, T; P21914, DHSB_DROME, T;
DR P07014, DHSB_ECOLI, T; P21912, DHSB_HUMAN, T; O42772, DHSB_MYCGR, T;
Elementary Sequence Analysis
160 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
......................................................................
........................ Material Deleted ...........................
......................................................................
A number of proteins have been found [3] that include one or more 4Fe-4S
binding domains similar to those of bacterial-type ferredoxins. These proteins
are listed below (references are only provided for recently determined
sequences).
centers.
- The chloroplast frxB protein which is predicted to carry two 4Fe-4S centers.
- An ferredoxin from a primitive eukaryote, the enteric amoeba Entamobea
histolytica.
- Escherichia coli hypothetical protein yjjW, a protein with a N-terminal
region belonging to the radical activating enzymes family (see <PDOC00834>)
and two potential 4Fe-4S centers.
This is probably more information about ferredoxin than you would ever want. But should you desire more there are links
to the INTERPRO entry for this domain. In this case it is (in part)
There is even a link to give a graphical interpretation of the block’s taxonomic diversity and graphical demonstrations of
the block’s location within proteins as shown in Figure 8.7.
A really great resource.
Elementary Sequence Analysis
164 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
8.3 SSearch
At the extreme slow end of database searchers is SSEARCH. This does a universal sequence comparison using the Smith-
Waterman algorithm (T.F. Smith and M.S. Waterman, J.Mol.Biol. 147:195-197, 1981). That is, it is completely rigorous
comparison of each sequence with the query sequence. This program uses code developed by X. Huang, R.C. Hardison,
W. Miller (1990 CABIOS 6:373-381) for calculating the local similarity score and code from the ALIGN program (see
below) for calculating the local alignment. SSEARCH is about 100-times slower than FASTA with ktup=2 (for proteins).
The program itself is available for download as part of the FASTA package of programs.
A study by Pearson (1995 Protein Science 4:1145-1160) compared the different methods of searching the protein databases.
He found that the complete Smith-Waterman algorithm performed best to find distantly related homologies, followed by
FASTA and then blastp when using suitable scoring matrices (BLOSUM55 – more on these later) and optimal gap
penalties.
Sir - We have discovered a startling similarity between a dinosaur DNA whether the copyright on Jurassic Park takes precedence over the pend-
sequence reported in the novel Jurassic Park1 and a partial human brain ing patent on the human sequence. However, it appears that neither group
cDNA sequence from the Venter laboratory described in Nature2 (see fig- is entitled to legal protection for its sequence, because both sequences also
ure). align with cloning vector pBR322, raising the possibility that both groups
inadvertently sequenced vector DNA.
The dinosaur sequence (DINO1) consists of duplication, with 117 base
pairs from the first member of the repeat aligning with the human sequence, Alan C. Christensen, Dept of Biochemistry and Molecular Biology,
HUMXT01431, at the 95 per cent level of identity with only two gaps. The Thomas Jefferson University, Philadelphia, Pennsylvania, 19107 USA.
extraordinary degree of nucleotide sequence conservation between organ-
isms as distantly related as dinosaur and human suggests strongly con- Steven Henikoff, Howard Hughes Medical Institute and Basic Sciences
served function. Expression of HUMXT01431 in human brain raises the Division, Fred Hutchinson Cancer Research Center, Seattle Washington
possibility that the dinosaurs were smarter than has been supposed, argu- 98104 USA.
ing against the hypothesis that their extinction resulted from lack of intelli-
1 Crichton, M. Jurassic Park, 102 (Ballantine, New York 1990).
gence.
2 Adams, M.D. et al., Nature 355, 632-634 (1992).
Our discovery also seems to raise the interesting legal question as to
With such good jokers in the world as these gentlemen are, you don’t want to get caught by them.
Chapter 9
Reconstructing Phylogenies
9.1 Introduction
9.1.1 Purpose
The purpose of phylogenetic reconstruction is to attempt to estimate the phylogeny for some data. For any collection of
data there will be some ancestral relationship between the sampled sequences. The data itself contains information that can
be used to reconstruct or to infer these ancestral relationships. This involves reconstructing a branching structure, termed a
phylogeny or tree, that illustrates the relationships between the sequences.
The following discussion is based mainly on Molecular Evolutionary Genetics by M.Nei, Genetic Data Analysis by B.Weir,
Of URFs and ORFs by R.Doolittle, Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit by H.von
Gunnar, Molecular Systematics by Hollis & Moritz and J. Felsenstein (1982, Quart.Rev.Biol.57:379). Refer to these for
more detailed information.
As stated, phylogenetic reconstruction attempts to estimate the phylogeny of some observed data. However, usually people
are more interested in using the data to try to infer the species phylogeny and not just the phylogeny of the data. In general,
these two are not always the same and estimating the species tree may not possible. Instead what is estimated has been
called a “gene tree” by M.Nei. This is because your data (the sequence of some gene or some other form of data) may not
have had the same phylogenetic history as the species within which they are contained.
Consider the species shown in Figure 9.1 (from Nei, 1987). The boxes represent the actual species and the dots represent
the genes themselves. In the first example, a reconstructed phylogeny based on these genes would yield something similar
to the true species tree. In the second example, the reconstructed phylogeny will provide the same topological tree as
the species tree but the branch lengths will all be quite incorrect. In the third example, the reconstructed phylogeny will
positively give an incorrect phylogeny. It would suggest that species Y and Z are more closely related when in fact X and
Y are more closely related.
All of this stems from the fact that polymorphism can exist within species and the estimated age of many polymorphisms
can be quite old. The problem of estimating the wrong topology will be greater when the true distance between speciation
events A and B is small.
Even if the first situation applies there may still be errors introduced because the number of changes from one species to
Elementary Sequence Analysis
166 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 9.1: Three possible relationships between species (X, Y, Z) and the genes they contain (indicated by dots) when
polymorphism is possible at times of speciation (from Nei, 1987).
Time
A A A
B B B
X Y Z X Y Z X Y Z
−T/2N −T/2N −T/2N
1−e 1/3e 2/3e
the next is often a stochastic event and subject to sampling error. Hence unless a large number of sites are examined there
can be large errors introduced. In addition, if the gene is part of a multigene family it may be difficult to determine the
homologous comparable gene in another species. Horizontal gene transfer and gene conversions from unrelated genes are
also assumed not to have occurred.
Finally, ignoring all of these caveats, people still usually consider a phylogenetic reconstruction program to act like a “black
box”. It takes input, churns around for a while and then spits out the actual phylogenetic answer. This is also incorrect.
First, the actual phylogenetic answer can not be obtained by any known method. All methods can only provide estimates
and educated guesses of what a phylogenetic tree might look like for the current set of data. These estimates are only as
good as the data itself and only as good as the algorithm. Some algorithms in common use are actually quite poor methods.
Finally, since these algorithms provide just an estimate, most good methods should also provide an indication of how much
variation there is in these estimates.
The problem of tree reconstruction is quite difficult. This is particularly true if all potential tree topologies must be scored
or otherwise searched. For three species there are only three trees possible. They are ...
A B C B A C C A B
While with four species there are a total of fifteen different topologies possible.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 167
A B C D A D C B A C B D
B A C D B D C A B C A D
C B A D C A B D C D A B
D B C A D A B C D C A B
A B C D A C B D A D B C
For 5 species there are 105 different topologies. More generally, for any strictly bifurcating phylogeny with n species there
are
different topologies. This number gets large very quickly1 . With n = 15 species there are
different trees. Obviously if an algorithm must examine all possible trees, then only a handful of species would be permit-
ted. Even given these, there would be an infinite number of branch length combinations that would have to be searched.
Indeed this problem belongs to a class of problems that are called NP-hard by computer scientists.
These numbers apply to phylogenies that are rooted. That is there is a point of origin for this phylogeny and it appears in
the standard classical fashion that you are probably most familiar with. A phylogeny may also be presented in an unrooted
fashion in which case it is called an unrooted tree or a network. For n species there are only
different unrooted trees possible (one step behind the number of rooted trees). Any method that purports to provide variable
rates of evolution (or substitution) along each branch should generate its output in the form of an unrooted tree. This is
because when rates of evolution are free to vary there is no way to determine the location of a root for a tree.
9.1.3 Terminology
There are whole dictionaries that have been created for this field of science (so that people can talk very precisely about
what they mean - though this still has not helped to avoid many confusions and useless fights). All of the background
1A tree calculator
Elementary Sequence Analysis
168 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
and terminology necessary can not be provided here. For the present, I simply want to provide you with a subset of the
terminology that will enable you to understand some of the problems, some of the methods and to be able to read some of
the literature.
The topology of a tree is simply the branching order of the species independent of the branch lengths. Different phylogenies
can have the same topology and yet look quite different due to variable branch lengths. If however, branch lengths are all
drawn to the same scale then two phylogenies might appear similar whether or not they are identical. Note for example,
that
A E C D B F F B D C E A
A E C D B F A E D C B F
Hence only synapomorphic characters are useful for determining a phylogeny. As an example, suppose that there is a
genus of plants in which one species develops red petals (with the ancestral form being white petals). Suppose it underwent
speciation such that there are now two red-petalled species and that there still exist five white-petalled species. Then white
petals is the plesiomorphic character, red petals is an apomorphic character, the white petals among the five species is a
symplesiomorphic character, the red petals among the two species is a synapomorphic character (and points to these two
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 169
species as being phylogenetically related). If another species arose with purple petals, this would be an autapomorphic
character. Note that this depends on being able to identify the primitive or ancestral state. It is generally not possible to
unambiguously determine the direction of change for nucleotide characters (hence some strict adherents would claim that
you can not produce cladograms from this data).
In addition, multistate characters can be either ordered or unordered. Nucleotide sequences are considered to be unordered
since a “C” for example is not necessarily intermediate between “A” and “G”. Again, it is more normal to encounter
ordered characters in the analysis of morphological characters. There is also the concept of character polarity which is
the assessment of the direction of character change. This most generally involves identifying one character as an ancestral
state.
9.1.4 Controversy
Within the field of phylogenetic reconstruction and taxonomy there have, in the past, been two different ways and and two
different philosophies to the process of reconstructing a phylogeny. The discussions between these groups about the best
ways to proceed have often been acrimonious and counter-productive.
One approach is the phenetic approach. In this approach, a tree is constructed by considering the phenotypic similarities
of the species without trying to understand the evolutionary pathways of the species. Since a tree constructed by this
method does not necessarily reflect evolutionary relationships but rather is designed to represent phenotypic similarity,
trees constructed via this method are called phenograms.
The second approach is called the cladistic approach. Via these methods, a tree is reconstructed by considering the various
possible pathways of evolution and choosing from amongst these the best possible tree. Trees reconstructed via these
methods are called cladograms.
The phenetic philosophy as a way to do taxonomy is definitely incorrect. However, this does not mean that phenetic
methods are necessarily poor estimates of the cladogram. For character data where ancestral forms are known and to
construct a taxonomic classification the cladistic approach is almost certainly superior. However, the cladistic methods are
often difficult to implement with assumptions that are not always satisfied with molecular data. The phenetic approaches
are generally faster algorithms and often have nicer statistical properties for molecular data. Hence, there appears to be a
place for both types of methods in the analysis of molecular sequence data.
This method begins with the construction of a distance matrix (dij ). The two taxa that have the smallest distances are
clustered together (assume that this is between the i-th and j-th taxa) and form a new OTU. The branch lengths for the i-th
and j-th taxa are taken to be half of the distance between them (hence the depth of the branch between i and j is dij /2).
A new distance matrix is constructed that replaces all distances involving the i-th and j-th taxa with the average distance
to these two. Thus, for the k-th taxa its distance to the new (i, j) cluster is defined as (dik + djk )/2. The branch length
is taken to be the average distance between the OTUs. Then again, the two taxa or OTUs with the smallest distances are
clustered together. If the smallest distance were between the k-th taxa and the new (i, j) cluster, the new distance to the
l-th taxa is defined as (dil + djl + dkl )/3. In general if OTU i and OTU j are to be clustered then the new distance is
dk(i,j) = (Ti dki + Tj dkj )/(Ti + Tj ) (where Ti is the number of taxa in OTU i). This process continues until all OTUs
have been clustered together.
Some data come naturally in the form of a distance between species. For example measures of DNA homology through
DNA hybridization / melting curves and measures from immunological data. For other forms of data, the distances are
calculated from sequences of characters. The reduction of this data from sequences to a single number obviously leads to
a loss of information. But you can gain a great deal from the speed and simplicity of these distance methods.
With distance methods it is generally assumed (whether intended or not) that the sum of the branch lengths in such trees
correlates directly with the expected phenotypic distance between taxa and further more that this corresponds to some
proportional measure of time. This is generally not a valid assumption. Hence corrections for distances and accurate
measures of the distance become very important.
This method obviously assumes that the taxa are all extant and that all rates of change are equal. This is an explicit
assumption of the method and yet we know of many examples where rates of evolution vary between taxa. Violation of
this assumption will cause the UPGMA algorithm to perform very poorly.
Another very popular distance method is the Neighbour Joining Method (Saitou and Nei 1987, Mol. Biol. Evol. 4:406).
This method attempts to correct the UPGMA method for its strong assumption that the same rate of evolution applies to
each branch. Hence this method yields an unrooted tree. A modified distance matrix is constructed to adjust for differences
in the rate of evolution of each taxon. Similar to the UPGMA method, the least distant pairs of nodes are linked and their
common ancestral node is added to the tree, their terminal nodes are pruned from the tree. This continues until only two
nodes remain.
The method begins by finding the modified matrix. To do this calculate the net difference of species i from all other taxa
as
X
ri = dik
k
where n is the number of taxa. Saitou & Nei showed that this equation for Mij (modulo the addition of a constant) is the
sum of the least-squares estimates of branch lengths. The next step is to join the two nodes/taxa with the smallest Mij and
define the new branch lengths to this node, say u, as
Remove nodes i and j, decrease n by one and recalculate ri , etc. This continues until only two nodes remain and these two
are linked with a branch length of lij = dij .
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 171
Another common pairwise clustering algorithm is that due to Fitch and Margoliash (1967, Science 155: 279). This method
yields an unrooted tree and unlike the two previous methods it does not proceed by adding taxa one at a time to a growing
tree. Rather it has an optimum criterion that must be met. This method attempts to find that tree which minimizes the
following sum
X
(d − d0 )2 /d2
where d is the observed distance and d0 is the expected distance given some phylogeny and assuming additivity between all
the branch lengths. The details are not given here but are provided in the original paper or in the code and the documentation
provided by Dr. Felsenstein.
There are many other methods to reconstruct trees via distance measures. Distance methods are often preferred by people
that work in immunology, with frequency data or with data that has some impreciseness in its definition. In addition,
almost all of these methods are very rapid and easily permit statistical tests such as bootstraps. These methods loose their
accuracy as the number of substitutions goes up and since the correction for multiple substitutions at a single site will loose
precision. In this case, the distance methods will increasingly begin to generate less accurate trees. For this reason, with
very large trees (where the distance between the most diverged taxa is great) distance methods will do poorly in comparison
with methods that are more influenced by local topologies (Rice and Warnow, unpublished).
*
*
*
Species
1 2 3 4 1 3 2 4
Characters
ATCG ATCG ACCG ACCG ATCG ACCG ATCG ACCG
The tree on the left is the most parsimonious tree. It requires only a single evolutionary change (designated by the asterisk)
in the second site (a C to T transition). The tree on the right is not as parsimonious. It requires two evolutionary changes.
Hence the second tree would be rejected in favour of the first tree.
The principle of the maximum parsimony method is to infer the number of evolutionary events implied by a particular
topology and to choose a tree that requires the minimum number of these evolutionary events. In general this means
examining a large number of different topologies to search for those that have the minimum changes. For any particular
site there are several ways to determine the minimum number of evolutionary events. The Fitch (1971; Syst. Zool. 20:406-
416) parsimony criterion is a particularly easy way to count them for nucleotide or amino acid changes. For a particular
topology traverse toward the root of the tree. At each node, place the intersection set of the descendant nodes. If this set is
empty then place the union set at this node. Continue this for all sites and all nodes. The number of union sets equals the
number of events required.
As another example consider the five primate species shown in Table 9.1. These are all old world monkeys. You can collect
the DNA sequences for their cytochrome oxidase subunit I genes (note that these sequences are too short to provide good
phylogenetic information; more sequence data for these species is available but for pedagogical purposes we use only this
Elementary Sequence Analysis
172 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Table 9.1: Results for all fifteen possible trees for five primate taxa.
Cercopithecus aethiops Nasalis larvatus Presbytis melalophos Pygathrix nemaeus Trachypithecus obscurus
John Pickering, 2005 http://pck5.pick.uga.edu http://members.tripod.com http://www.animalinfo.org http://animaldiversity.ummz.umich.edu
one gene here). Since there are only five species considered here, there are a total of 15 unrooted trees possible. Parsimony
ideally should consider each tree in turn and infer the minimum number of substitutions required to explain the sequence
data given the tree. All 15 trees are shown in Table 9.1. The one tree out of all 15 trees with the minimum number is the
most parsimonious tree. In this case it is tree #5. Note that in this case, the tree inferred by the neighbor joining method
(and the maximum likelihood method; this method as well as bootstrapping will be further discussed below) suggest a
different tree is correct. The neighbor joining algorithm suggests that tree #4 is the preferred tree and, unlike parsimony,
the algorithm does not consider a metric for all trees but simply returns a single topology.
There are several problems with parsimony methods. First note that the most parsimonious tree may not be unique.
Consider the tree
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 173
Species
4 3 2 1
Characters
ACCG ACCG ATCG ATCG
This tree is just as parsimonious as that given above and yet is quite different (so long as only rooted trees are considered).
A more serious problem deals with the statistics of these estimators. Suppose that you reconstructed a phylogeny based on
a large amount of sequence data and found the most parsimonious tree is one that requires 486 changes. The tree that you
prefer (for whatever reason) requires 484 changes. Why and/or when is one phylogeny better than another. This is quite
a thorny problem and one of active current research. The best sorts of methods (including methods with algorithms other
than parsimony) are those methods that will present you with a whole range of trees that are acceptable via some broad
criterion.
Different sites are said to be phylogenetically informative for the parsimony criterion if they provide information that
distinguishes between different topologies. Not all sites do this and these sites are, in effect, ignored by the method.
Consider characters that are not ordered and can arise via mutation from any other character (such as DNA nucleotides).
Then any character that exists uniquely (or locally uniquely) in one OTU is not phylogenetically informative. This is
because such a character can always be assumed to have arisen by a single substitution in the immediate branch leading to
the OTU in which the character exists. This change is therefore compatible with any topology. A site is phylogenetically
informative only when there are at least two different kinds of characters, each represented at least two times. (Remember
however, that ALL SITES provide information about the branch lengths - this is true just for the topology).
Note that there are several different kinds of parsimony and the Fitch criterion is only one. As another example, Dollo
parsimony is also commonly used. It assumes that derived states are irreversible. That is, a derived character state cannot
be lost and then regained. This criterion is most useful when discussing character data other than sequence data. For
example if states are complex phenotypes then it is reasonable to assume that these states can evolve only once. Hence, the
state can evolve and the state can be lost many times throughout evolution but it cannot be inferred to have evolved twice.
An example of such a state in sequence analysis would be restriction sites - these are easier to mutationally lose than to
mutationally create. Other parsimony criterion are relaxed Dollo, Wagner, Camin-Sokal, transversion, and generalized.
Many algorithms do not have a series of explicitly stated assumptions required in the derivation of the model and required
for its applicability. This is particularly the case with parsimony methods which are often said to be assumption “free”.
However, the lack of stated assumptions does not mean that no assumptions are necessary for the method to be valid. The
assumptions are implicit rather than explicit.
There is a strong bias in parsimony methods when some lineages have experienced rapid rates of change. While this is
true of many methods, parsimony methods are particularly sensitive. In general these long branches tend to “attract” each
other. Nor do parsimony methods necessarily lead to “correct” trees (nor, for that matter, does any other method). Prof.
Felsenstein provides an example of a comparison between four species. The “true” phylogeny is [(A,B),(C,D)], with A and
B most closely related and C and D most closely related. If B and D have a more rapid rate of evolution then parsimony
will usually generate a tree with [(A,C),(B,D)], with A and C most closely related and B and D related. Indeed after a
certain threshold of differential rates is passed, as more and more data are collected (more and more sequences added to
the database), parsimony becomes more and more certain that the “correct” tree is [(A,C),(B,D)]. Hence these methods
may not be consistent estimators of the phylogeny. Consistency is a term used in statistics that implies convergence of
an estimator to the true answer with increasing amounts of data. A maximum parsimony answer will however, converge
to a maximum likelihood answer when the rates of evolution along each branch are small (unfortunately this is not true
for most data sets). But then, maximum likelihood methods (and Bayesian methods) need not be consistent either. The
arguments as to which method is “best” continue (e.g. Kolaczkowski and Thornton, 2004 and compare with Gadagkar and
Kumar, 2005).
Elementary Sequence Analysis
174 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Parsimony does not require exact constancy of rates of change between branches if the number of substitutions per site is
small. If the number of changes per site is large then parsimony methods will make serious errors unless rates are constant
between branches. Furthermore, if the total sequence length examined is small and there are a large number of backward
and parallel substitutions (as in immunoglobulins) then parsimony has a high probability of producing an erroneous tree
even when substitution rates are constant between branches. Also, when the number of substitutions per site is small, a
large proportion of the substitutions are autapomorphic and uninformative for constructing a parsimonious tree. In this
case, a distance method may perform better since it uses all sites to compute distances.
where δij = 1 if i = j and δij = 0 otherwise and gj is the equilibrium frequency of nucleotide j.
(k)
The likelihood that some site is in state i at the k-th node of an evolutionary tree can be designated by Li . This likelihood
can be calculated in a recursive fashion. As an example, consider a simple bifurcating tree with two branches #1 and #2,
and with one root node, #3. The time between node #3 and node #1 is t and the time separating node #3 and #2 is t0 . With
these definitions the likelihood of having the i-th nucleotide at node #3 in an evolutionary tree can be found as
4 4
(3) (1) t0 (2)
X X
Li =( Pijt Lj )( Pik Lk )
j=1 k=1
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 175
(1) (2)
(Felsenstein, 1981). The terms Lj , Lk designate the likelihoods of states j and k in the nodes or taxa #1 and #2. If nodes
#1 and #2 designate extant species then these likelihoods are known explicitly. The likelihoods are either 1 or 0 depending
on whether the extant species does or does not have that nucleotide at that particular site. In words, you calculate the
probability that the descendant would end up having nucleotide j given that t generations in the past it had nucleotide i
and multiply this by the likelihood that the descendant had nucleotide j. Sum this for all possible nucleotides and do the
same for the other branch on the tree. Take the product of the two to give you the likelihood of the tree up to this point.
(3)
This information determines Li . In more complicated phylogenies with more than two species the likelihoods of interior
nodes can be calculated in a similar fashion, recursively. In this case identify node #3 with a more ancient bifurcation and
nodes #1 and #2 with bifurcations that in turn give rise to more species. Begin at the tips of the phylogeny and move down
(3)
the tree one node at a time. Each successive step uses the likelihoods just calculated (such as the value determined for Li )
to find the likelihood of the next node. The likelihood of every state is calculated for every node using those likelihoods
calculated for the previous nodes. This continues until the root of the tree is reached and then the overall likelihood is found
by summing the products of the root likelihoods with the prior probabilities of each state. Without any further information,
the prior probabilities of each state are usually taken to be their equilibrium frequencies.
Since each site evolves independently, the likelihood of a phylogeny can be calculated separately for each site. The product
of the likelihoods for each site provides the overall likelihood of the observed data. To maximize the likelihood different
values of ut are analyzed until a set of branch lengths/substitution rates are found which provide the highest likelihood of
observing the actual sequences. Finally different tree topologies are searched to find the best one.
Note that a likelihood is not quite the same thing as the probability of observing the given sequences and nor are likelihoods
the same thing as a probability. For example, a set of maximum likelihoods need not sum to one. In general, you would
normally have the probability of some observed data as a function of some parameter (here the parameters are the branch
lengths/substitution rates). The likelihood function turns this relationship around. Instead of considering this to be a set of
probabilities for alternative observations given some parameter, it considers the data as fixed and the likelihood as a function
of the parameters. For more information on the powerful abilities of likelihood methods consult a text on probability.
For whatever the correct topology is, one of these equations should be different from zero and the other two should be
equal to zero (or close to it due to random events). The significance of all of the scores can be tested via a Chi-square or
via an exact binomial test.
For the topology ((A, B), (C, D)) . . .
X Z X Z X Z X Z
+ − + = 0
X Z Y W X W Y Z
X X X Y X X X Y
+ − + = 0
Z Z Z W Z W Z Z
X Z X W X W X Z
+ − + = 0
Z X Z Y Z X Z Y
X X X Y X X X Y
+ − + = 0
Z Z Z W Z W Z Z
X Z X Z X Z X Z
+ − + = 0
X Z Y W X W Y Z
X Z X Z X Z X Z
+ − + = 0
Z X W Y W X Z Y
X X X Y X X X Y
+ − + = 0
Z Z W Z W Z Z Z
X Z X Z X Z X Z
+ − + = 0
Z X W Y W X Z Y
X Z X Z X Z X Z
+ − + = 0
X Z Y W X W Y Z
While this method was suggested for only four species there have been several extensions suggested to develop it further
and apply it to more than four species. Similarly there are extensions to consider not only transversions but to consider
in complete generality all four nucleotides. Steel and Fu (1995 J Comput Biol 2:39), Fu (1995 J. Theo. Biol. 173: 339),
Cavender (1991 Math Biosci 103:69), Felsenstein (1991 J. Theo. Biol. 152:357), and Sankoff (1990 Mol. Biol. Evol.
7:255), have developed quadratic and higher order invariants (or in Sankoff’s words “made to order invariants”). These
extensions promise that invariants will be a very useful tool in the future since these methods are dependent only on the
branching order.
trees for four species, the total number of trees that need be constructed are only 3 × n4 for n taxa. This step is therefore
quite feasible even for large n. For example, with n = 20 there are only 3 × 20
4 = 14535 trees that need be constructed.
In principle, these quartets should uniquely determine the topology of the tree. For a tree with six taxa
A B
C
F
D
E
6
there are 4 = 15 quartets. These are with taxa
A C A B A B
B D E C F C
A B A B A E
E D F D B F
and so on.
In practice the various quartets may not agree with other and some method must be chosen to weight their suggested
topologies. Strimmer and von Haeseler (1996) used a method that weighted the three topologies (1,0,0), (0,1,0) or (0,0,1).
They then choose four taxa at random, and began adding taxa one at a time to this four taxa tree according to the quartets.
As an example, if there are four taxa (A, B, C, D) initially and if taxa E has a quartet such that ((A,B),(C,E)) then E should
not be placed on branch leading to A or B. If it has a quartet ((A,D),(B,E)) then it should not be placed on a branch leading
to A or D. Running through all quartets containing E a score is kept for all branches and the branch point with the minimum
score is chosen as the branch point to place taxa E. The next taxa is chosen and treated in the same way and then the next
taxa.
The order in which the taxa are added and the initial taxa chosen to start the process will critically influence the resulting
tree. To prevent any bias due to the order, this whole process is done multiple times with random choices for the order
of taxa. A majority rule consensus tree is then chosen as the final tree. This also means that a measure of variability is
immediately available in the form of how many times a particular group of taxa branched together. Note that this measure
is not the same as a bootstrap value and does not necessarily have the same statistical properties.
The quartet methods are useful for their comparative speed. A maximum likelihood algorithm can be applied with this
algorithm to problems that would otherwise not be feasible. As a result, Strimmer and von Haeseler were able to show that
this method obtained results as good as neighbor joining when the data was well behaved and results better than neighbor
joining when the data had large variations in branch length (a situtation where likelihoods are known to do better). The
method performed only slightly worse than Felsenstein’s complete maximum likelihood method.
Quartet methods are also insteresting in thier ability to separate each of the individual steps and to then easily permit the
incorporation of improvements in each step. For example, the original quartets can be constructed via any algorithm. The
algorithm know to give the best results for a particular data set can then be chosen (or one known to be most robust under
the broadest variety of circumstances). The construction of the tree from the quartets is a completely separate step that
can be optimized as well. In a subsequent paper Strimmer, Goldman and von Haeseler (1997; Mol. Biol. Evol. 14:210-
211) study the influences of different weights for each quartet and develop a discrete weighting which is both efficient
Elementary Sequence Analysis
178 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 9.2: A quartet puzzling tree for eight NAD5 mitochondrial proteins. These are three of the 15 trees that quartet
puzzling considers unresolved for this data.
and improves that accuracy of the trees reconstructed. There are similarly a variety of methods possible to reconstruct
consensus trees from multiple trees.
interested in. If theoretical knowledge of the statistics is available then it should be used in preference to a bootstrap. If the
data itself is biased, then bootstraps tend to exaggerate this bias and again bootstraps should either not be used or corrected
for the suspected bias. The bootstrap is advantageous when there is no knowledge of the true statistical properties such as
when the underlying distribution is unknown. This is the case in phylogenetic studies.
To illustrate, consider the problem of estimating 95% confidence limits on the mean. Suppose we draw 100 samples from
a population with a underlying normal distribution with mean µ = 5 and standard deviation σ = 10 (a graph of this data
is given in Figure 9.3). Due to sample effects, the mean of this data is x̄ = 4.3 and its standard deviation is s = 9.15.
To estimate 95% confidence limits for the mean, we then sample with replacement each value of x until a hundred new
samples have been drawn. The statistic of interest (mean, median, confidence limits, or whatever else is of interest) is
recalculated for this data set and then the whole is process is repeated (here we will repeat it ten times but in practice
bootstraps must be repeated with much larger sample numbers; on the order of 1000 or more). The statistics from ten
potential samples from the original n = 100 data set are shown in the accompanying Table 9.2. For each repeated sample
statistics are calculated just as one would traditionally with the original data set. These can then be combined to yield
bootstrapped estimates. Bootstrap statistics allow an alternate estimate of the 95% confidence interval. For example, in
this case, the 10% confidence limit on the mean would be calculated from the values in Table 9.2 and would be 3.0 and the
90% confidence limit on the mean would be 5.5 (of course based on such a limited sample size of 10, neither is a useful
estimate).
Felsenstein first recommended applying the bootstrap method to phylogenies. A useful review of this and other methods
of assessing the reliability of phylogenies is given in J. Felsenstein Annu. Rev. Genet. 22: 521-565, 1988. Again, the idea
is resample the sequence data (with replacement) site by site to construct a new sequence data set of the same length as
the original and then to estimate a set of bootstrap trees. The original sequences (x) are used to estimate a distance matrix
(D) by some method that measures differences between sequence pairs. This distance matrix is converted into a tree by an
algorithm that connects sequence pairs into an unrooted, bifurcating tree (T ). Alternatively the tree (T ) can be obtained
directly from the sequences by parsimony (or your method of choice). Felsenstein’s method is to randomly sample sites
Elementary Sequence Analysis
180 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
(columns of sequence set x) from the sequence data with replacement to form a bootstrap data set (xy ; Figure 9.4). The
original algorithm is then applied to this data to yield a bootstrap tree (T y ). Repeated bootstrap samples yield a set of
bootstrap trees T 1−n . These trees are derived from sequences containing representative sites sampled from the actual data.
The assumption of this method is that the sampled sites are independent of one another and representative of what the
evolutionary process would produce if repeated.
There are many ways in which the bootstrap set of trees could be used to answer questions about significance. Felsenstein
suggested that the significance of phylogenetic relationships could be assessed from their frequency of occurrence in
the bootstrap set T 1 , T 2 , . . . T n . More specifically, suppose one wanted to know if a subgroup of taxa (called G) were
monophyletic (exclusively comprised of descendants of a common ancestor). We determine the fraction of trees in the
bootstrap set T 1 , T 2 , . . . T n in which G is, in fact, monophyletic (call this FG ). Obviously if FG is small there is little
support form monophyly while if F G is close to 100% we feel that a monophyletic grouping is more likely. Felsenstein
recommended that FG ≥ 95% be considered as significant support of a monophyletic relationship.
Analysis of NAD5 mitochondrial genes illustrates this method of using bootstrap trees to determine phylogenetic signifi-
cance. NAD5 is one of the larger proteins encoded within the mitochondria and hence it is easy to obtain the DNA and to
sequence this gene. Phylogenetic relationships for four diverged examples of the NAD5 sequences were determined. Boot-
strap DNA sequences (PHYLIP: seqboot) were used to determine pairwise distance matrices (PHYLIP: dnadist, Kimura
2-parameter, T s/T v = 2.0) and the Fitch-Margoliash, least squares method (PHYLIP: fitch) for merging taxa was then
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 181
Bee Brine Shrimp
500
Nematode Chicken
Figure 9.5: The consensus of 500 bootstrapped trees based on NAD5 gene sequences
used to make a set of 500 bootstrap trees. The consensus of all 500 of these trees grouped the brine shrimp with the chicken
and grouped the bee with the nematode (Figure 9.5). Hence, in this case the support is clear and unequivocal for this one
of the three possible relationships.
Although bootstrap methods are widely used, there is considerable debate over the measurement of the actual level of
statistical validity. Many have criticized that the 95% criterion is too conservative and accept smaller FC values (e.g.
≥ 80%) as indicating significant monophyletic grouping. Support for this less conservative interpretation of FG comes
from simulation and theoretical work. However, some considerations in favor of a more conservative approach are the
following.
1. The consensus tree itself is often used to suggest monophyletic groups. Thus, these are not tested a priori. It is
possible that phylogenetically uninformative data may sometimes generate groupings by chance. Since it is not clear
how often this could happen, it is best to be conservative and demand a larger value of FG .
2. The conensus tree effectively tests all possible groups formed from the set of taxa. A few such groups could reach
high FG just by chance.
3. If the group G is actually monophyletic, the value of FG estimates the probability that it would appear as mono-
phyletic in sequence data of the type obtained. It is not a confidence interval. It determines the stability of the
phylogenetic relationship if more sequence data of the same type were to be accumulated.
4. Bootstrap statistics are assumed to vary smoothly and continuously in sample space. However, a taxonomic group
can either be monophyletic or not monophyletic. Methods to extend bootstrap methods to bivariate statistics have
not been developed.
5. The bootstrap method assumes that sites are independent and that a sufficient number have been sampled to give a
complete representation of the evolutionary process.
Finally, the bootstrap trees are only as good as the method that generates them. Again, if the method is biased, the trees will
not be representative examples of evolution. In particular, the trees generated for the NAD5 genes are certainly biased (and
were purposely used to illustrate this point – no, bees and nematodes are not close relatives). The DNA distance method
not only did not account for rate variation among sites (e.g. first, second and third codon positions), but also ignored
large base composition differences between sequence pairs. The substitution model for DNA distance assumes that all
sequences are subject to the same mutational forces leading to equilibrium nucleotide frequencies of 25% for each of the
four nucleotides. This is certainly not the case for the NAD5 genes. Thus the method artificially makes sequences of similar
nucleotide composition (e.g. bee and nematode) closer because the expected number of substitutions is underestimated.
The significant grouping of these taxa is entirely a result of this bias. Indeed, otherwise this would provide strong evidence
that the brine shrimp and the bee do not form a monophyletic grouping commonly known as arthropods. Steel et al. Nature
364: 440-442, 1993 have discussed a randomization test that corrects for such nucleotide biases and can be used to show
that the evidence for the association displayed in Figure 9.5 is due to the nucleotide bias and is not an accurate phylogenetic
reconstruction.
9.7 Warnings
Remember that each of these methods have their advantages and their disadvantages. They provide estimates of what the
phylogenetic history of the sample may be like - they do not provide “truth”. When you run an algorithm for your data set,
consider this simply as the starting point of your analysis.
Elementary Sequence Analysis
182 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
There are several approaches that can be taken to begin an in depth phylogenetic analysis of your data. These are a few
suggestions but they are not exhaustive - for different data sets additional steps should be taken.
1. The first rule that should be followed is to apply several different algorithms to your data set. Each one will provide
a different picture of the phylogenetic history reflecting the assumptions of the methods.
2. Your data should be bootstrapped or jackknifed to sample your data. These are techniques to create new data sets
either by sampling with replacement from the original set or, in the case of a jackknife, by successively dropping
individual data points. They will help to determine how sensitive the phylogenetic history is to changes in the data
set. The actual statistics for these cases are non-standard and difficult to calculate (ie: 95% bootstrap support does
not necessarily imply that one should expect a 95% probability that that clade is correct) but it will provide a rough
measure of variability.
3. If the data and tree inference technique were ideal, analysing any two subsets of taxa would yield congruent trees
(i.e., the trees would be identical after pruning taxa absent from one or both trees). Try this and see what happens
for different subsets.
4. In this regard, if the tree changes dramatically when a single OTU is dropped this is usually an indication that that
OTU is causing systematic errors (such as would be caused by a significantly different rate of change).
5. Worry about long unbranched lineages and any subtrees on either side of long branches. Long branches tend to
attract each other !!!
6. Remember that these are gene trees and hence the trees from different genes may or may not be the same. If your
taxa are each sufficiently diverged then the trees should be similar. If not then check for non-orthologous genes,
check for lateral gene transfer or for other events that would cause systematic errors.
7. Always include more than one outgroup taxa. In this way you can check that the outgroups are indeed “out”.
8. If possible choose your outgroup species such that they are evenly spaced on the tree. You will obtain more reliable
information from these. Two outgroups that are closely related to each other will not add much information.
9. Even if you are interested in the relationships of just a few taxa it is best to include as many intermediate taxa as
possible. These will help to highlight the multiple substitutions that confound any analysis.
10. Others have suggested that because large branch lengths confound many methods, one should limit an analysis to
those sequence regions that exclude the most variable positions. (I personally disagree with this rule of thumb but
hey ..!)
– Hennig86 – T-REX
– MEGA – sendbs
– Tree Gardener – nneighbor
– RA – DAMBE
– Nona – weighbor
– PHYLIP – QR2
– TurboTree – DNASIS
– Freqpars – minspnet
– Fitch programs – PAL
– CAFCA – Arlequin
– Phylo win – vCEBL
– sog – HY-PHY
– gmaes – Vanilla
– LVB – GelCompar II
– GeneTree – Bionumerics
– TAAR – qclust
– ARB – TCS
– DAMBE • Computation of distances
– MALIGN
– POY – PHYLIP
– DNASEP – PAUP*
– SEPAL – RAPDistance
– Gambit – MULTICOMP
– TNT – MARKOV
– GelCompar II – RSVP
– Bionumerics – Microsat
– TCS – DIPLOMO
– OSA
• Distance matrix methods – DISPAN
– RESTSITE
– PHYLIP – NTSYSpc
– PAUP* – TREE-PUZZLE
– MEGA – Hadtree, Prepare and Trees
– MacT – GCG Wisconsin Package
– ODEN – AMP
– Fitch programs – GCUA
– ABLE – DERANGE2
– TREECON – POPGENE
– DISPAN – TFPGA
– RESTSITE – REAP
– NTSYSpc – MVSP
– METREE – SOTA
– TreePack – RSTCALC
– TreeTree – Genetix
– GDA – BIOSYS-2
– Hadtree, Prepare and Trees – RAPD-PCR package
– GCG Wisconsin Package – DISTANCE
– SeqPup – Darwin
– PHYLTEST – sendbs
– Lintre – K2WuLi
– GeneStrut
– WET – Arlequin
– Phylo win – DAMBE
– njbafd – DnaSP
– Gambit – PAML
– gmaes – puzzleboot
– DENDRON – MATRIX
– PAL
• Molecular Analyst Fingerprinting – Sequencer
– Vanilla
– BIONJ – GelCompar II
– TFPGA – Bionumerics
– MVSP – qclust
– SOTA
– ARB • Maximum likelihood and related methods
– BIOSYS-2
– Darwin – PHYLIP
Elementary Sequence Analysis
184 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
– PAUP* – TreeThief
– fastDNAml – RadCon
– MOLPHY – Mavric
– PAML
– Spectrum • Looking for hybridization or recombination events
– SplitsTree
– PLATO – PLATO
– TREE-PUZZLE – Bootscanning Package
– Hadtree, Prepare and Trees – TOPAL
– SeqPup – reticulate
– Phylo win – RecPars
– PASSML – partimatrix
– ARB – homoplasy test
– Darwin – LARD
– BAMBE – Network
– DAMBE – TCS
– Modeltest
– TreeCons • Bootstrapping and other measures of support
– VeryfastDNAml
– PAL – PHYLIP
– dnarates – PAUP*
– TrExMl – PARBOOT
– HY-PHY – ABLE
– Vanilla – Random Cladistics
– MEGA – AutoDecay
– Bionumerics – TreeRot
– fastDNAmlRev – RASA
– RevDNArates – DNA Stacks
– rate-evolution – OSA
– MrBayes – DISPAN
– Hadtree, Prepare and Trees – TreeTree
– CONSEL – PHYLTEST
– Lintre
• Quartets methods – sog
– njbafd
– MEGA
– TREE-PUZZLE – PICA95
– STATGEOM – ModelTest
– SplitsTree – TAXEQ2
– PHYLTEST – BIOSYS-2
– GEOMETRY – RAPD-PCR package
– PICA95 – TreeCons
– Darwin – BAMBE
– PhyloQuart – DAMBE
– Willson quartets programs – puzzleboot
– Gambit – CodonBootstrap
– DNASEP
• Artificial-intelligence methods – SEPAL
– Gambit
– SOTA – MEAWILK
– TrExMl
• Invariants (or Evolutionary Parsimony) methods – Sequencer
– PAL
– PHYLIP – PHYCON
– PAUP* – MrBayes
– CONSEL
• Interactive tree manipulation
• Compatibility analysis
– MacClade
– PHYLIP – COMPROB
– PDAP – PHYLIP
– TreeTool – PICA95
– ARB – reticulate
– WINCLADA – partimatrix
– TreeEdit – SECANT
– UO – CLINCH
– TreeExplorer – MEAWILK
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 185
9.9 PHYLIP
This is the package of programs distributed by Professor Felsenstein. It is distributed free and Joe is a very friendly
character and can help with whatever problem you might have (but carefully read the documentation before contacting
him). I have reproduced parts of the documentation here but I urge you to get your own copy of the programs so that
Dr. Felsenstein can know how many copies are out there and can update/modify programs etc. Also if you do use these
programs in a publication you must quote Dr. Felsenstein (the same applies for any other program obtained from the
file servers). Remember that the value of a scientist’s work is often measured by quotations and if you use someone’s
programming work you should quote it just as you would quote their experimental work.
The PHYLIP package is distributed for free. Programs are written in a standard subset of “C” and the source code is
provided with the package. You can reach Dr. Felsenstein at [email protected] and the complete package can
be obtained via anonymous ftp from evolution.genetics.washington.edu/phylip.html.
On the following pages you will find extracts of the documentation for the PHYLIP package of programs. The complete
documentation is not reproduced - you should get your own official copy.
PROTPARS
Estimates phylogenies from protein sequences (input using the
standard one-letter code for amino acids) using the parsimony
method, in a variant which counts only those nucleotide changes
that change the amino acid, on the assumption that silent changes
are more easily accomplished.
DNAPARS
Estimates phylogenies by the parsimony method using nucleic acid
sequences. Allows use the full IUB ambiguity codes, and estimates
ancestral nucleotide states. Gaps treated as a fifth nucleotide
state. Can use 0/1 weights, reconstruct ancestral states, and
infer branch lengths.
DNAMOVE
Interactive construction of phylogenies from nucleic acid sequences,
with their evaluation by parsimony and compatibility and the display
of reconstructed ancestral bases. This can be used to find parsimony
or compatibility estimates by hand.
DNAPENNY
Finds all most parsimonious phylogenies for nucleic acid sequences
by branch-and-bound search. This may not be practical (depending
on the data) for more than 10 or 11 species.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 187
DNACOMP
Estimates phylogenies from nucleic acid sequence data using the
compatibility criterion, which searches for the largest number of
sites which could have all states (nucleotides) uniquely evolved
on the same tree. Compatibility is particularly appropriate when
sites vary greatly in their rates of evolution, but we do not know
in advance which are the less reliable ones.
DNAINVAR
For nucleic acid sequence data on four species, computes Lake’s
and Cavender’s phylogenetic invariants, which test alternative
tree topologies. The program also tabulates the frequencies of
occurrence of the different nucleotide patterns. Lake’s invariants
are the method which he calls "evolutionary parsimony".
DNAML
Estimates phylogenies from nucleotide sequences by maximum
likelihood. The model employed allows for unequal expected
frequencies of the four nucleotides, for unequal rates of
transitions and transversions, and for different (prespecified)
rates of change in different categories of sites, with the program
inferring which sites have which rates. It also allows different
rates of change at known sites.
DNAMLK
Same as DNAML but assumes a molecular clock. The use of the two
programs together permits a likelihood ratio test of the molecular
clock hypothesis to be made.
PROML
Estimates phylogenies from protein amino acid sequences by maximum
likelihood. The PAM or JTTF models can be employed. The program can
allow for different (prespecified) rates of change in different
categories of amino acid positions, with the program inferring
which posiitons have which rates. It also allows different rates
of change at known sites.
DNADIST
Computes four different distances between species from nucleic
acid sequences. The distances can then be used in the distance
matrix programs. The distances are the Jukes-Cantor formula,
one based on Kimura’s 2-parameter method, Jin and Nei’s distance
which allows for rate variation from site to site, and a maximum
likelihood method using the model employed in DNAML. The latter
method of computing distances can be very slow.
PROTDIST
Computes a distance measure for protein sequences, using maximum
likelihood estimates based on the Dayhoff PAM matrix, Kimura’s 1983
approximation to it, or a model based on the genetic code plus a
constraint on changing to a different category of amino acid. Rate
variation from site to site is also allowed. The distances can be
used in the distance matrix programs.
Elementary Sequence Analysis
188 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
RESTDIST
Distances calculated from restriction sites data or restriction
fragments data. The restriction sites option is the one to use to
also make distances for RAPDs or AFLPs.
RESTML
Estimation of phylogenies by maximum likelihood using restriction
sites data (not restriction fragments but presence/absence of
individual sites). It employs the Jukes-Cantor symmetrical model
of nucleotide change, which does not allow for differences of rate
between transitions and transversions. This program is very slow.
SEQBOOT
Reads in a data set, and produces multiple data sets from it by
bootstrap resampling. Since most programs in the current version
of the package allow processing of multiple data sets, this can
be used together with the consensus tree program CONSENSE to do
bootstrap (or delete-half-jackknife) analyses with most of the
methods in this package. This program also allows the Archie/Faith
technique of permutation of species within characters.
FITCH
Estimates phylogenies from distance matrix data under the "additive
tree model" according to which the distances are expected to
equal the sums of branch lengths between the species. Uses the
Fitch-Margoliash criterion and some related least squares criteria.
Does not assume an evolutionary clock. This program will be useful
with distances computed from molecular sequences, restriction
sites or fragments distances, with DNA hybridization measurements,
and with genetic distances computed from gene frequencies.
KITSCH
Estimates phylogenies from distance matrix data under the
"ultrametric" model which is the same as the additive tree model
except that an evolutionary clock is assumed. The Fitch-Margoliash
criterion and other least squares criteria are assumed. This program
will be useful with distances computed from molecular sequences,
restriction sites or fragments distances, with distances from DNA
hybridization measurements, and with genetic distances computed
from gene frequencies.
NEIGHBOR
An implementation by Mary Kuhner and John Yamato of Saitou and
Nei’s "Neighbor Joining Method," and of the UPGMA (Average Linkage
clustering) method. Neighbor Joining is a distance matrix method
producing an unrooted tree without the assumption of a clock. UPGMA
does assume a clock. The branch lengths are not optimized by the
least squares criterion but the methods are very fast and thus
can handle much larger data sets.
CONTML
Estimates phylogenies from gene frequency data by maximum
likelihood under a model in which all divergence is due to
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 189
GENDIST
Computes one of three different genetic distance formulas from
gene frequency data. The formulas are Nei’s genetic distance,
the Cavalli-Sforza chord measure, and the genetic distance of
Reynolds et. al. The former is appropriate for data in which new
mutations occur in an infinite isoalleles neutral mutation model,
the latter two for a model without mutation and with pure genetic
drift. The distances are written to a file in a format appropriate
for input to the distance matrix programs.
CONTRAST
Reads a tree from a tree file, and a data set with continuous
characters data, and produces the independent contrasts for those
characters, for use in any multivariate statistics package. Will
also produce covariances, regressions and correlations between
characters for those contrasts. Can also correct for within-species
sampling variation when individual phenotypes are available within
a population.
PARS
Multistate discrete-characters parsimony method. Up to 8 states
(as well as "?") are allowed. Cannot do Camin-Sokal or Dollo
Parsimony. Can reconstruct ancestral states, use character weights,
and infer branch lengths.
MIX
Estimates phylogenies by some parsimony methods for discrete
character data with two states (0 and 1). Allows use of the Wagner
parsimony method, the Camin-Sokal parsimony method, or arbitrary
mixtures of these. Also reconstructs ancestral states and allows
weighting of characters (does not infer branch lengths).
MOVE
Interactive construction of phylogenies from discrete character data
with two states (0 and 1). Evaluates parsimony and compatibility
criteria for those phylogenies and displays reconstructed states
throughout the tree. This can be used to find parsimony or
compatibility estimates by hand.
PENNY
Finds all most parsimonious phylogenies for discrete-character data
with two states, for the Wagner, Camin-Sokal, and mixed parsimony
criteria using the branch-and-bound method of exact search. May
be impractical (depending on the data) for more than 10-11 species.
Elementary Sequence Analysis
190 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
DOLLOP
Estimates phylogenies by the Dollo or polymorphism parsimony
criteria for discrete character data with two states (0 and
1). Also reconstructs ancestral states and allows weighting
of characters. Dollo parsimony is particularly appropriate for
restriction sites data; with ancestor states specified as unknown
it may be appropriate for restriction fragments data.
DOLMOVE
Interactive construction of phylogenies from discrete character
data with two states (0 and 1) using the Dollo or polymorphism
parsimony criteria. Evaluates parsimony and compatibility criteria
for those phylogenies and displays reconstructed states throughout
the tree. This can be used to find parsimony or compatibility
estimates by hand.
DOLPENNY
Finds all most parsimonious phylogenies for discrete-character
data with two states, for the Dollo or polymorphism parsimony
criteria using the branch-and-bound method of exact search. May
be impractical (depending on the data) for more than 10-11 species.
CLIQUE
Finds the largest clique of mutually compatible characters, and
the phylogeny which they recommend, for discrete character data
with two states. The largest clique (or all cliques within a given
size range of the largest one) are found by a very fast branch and
bound search method. The method does not allow for missing data. For
such cases the T (Threshold) option of PARS or MIX may be a useful
alternative. Compatibility methods are particular useful when
some characters are of poor quality and the rest of good quality,
but when it is not known in advance which ones are which.
FACTOR
Takes discrete multistate data with character state trees and
produces the corresponding data set with two states (0 and 1).
Written by Christopher Meacham. This program was formerly used to
accomodate multistate characters in MIX, but this is less necessary
now that PARS is available.
DRAWGRAM
Plots rooted phylogenies, cladograms, and phenograms in a wide
variety of user-controllable formats. The program is interactive
and allows previewing of the tree on PC or Macintosh graphics
screens, and Tektronix or Digital graphics terminals. Final output
can be to a file formatted for one of the drawing programs, on a
laser printer (such as Postscript or PCL-compatible printers), on
graphics screens or terminals, on pen plotters (Hewlett-Packard or
Houston Instruments) or on dot matrix printers capable of graphics
(Epson, Okidata, Imagewriter, or Toshiba).
DRAWTREE
Similar to DRAWGRAM but plots unrooted phylogenies.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 191
TREEDIST
Computes the Robinson-Foulds symmetric difference distance between
trees, which allows for differences in tree topology (but does
not use branch lengths).
CONSENSE
Computes consensus trees by the majority-rule consensus tree method,
which also allows one to easily find the strict consensus tree. Is
not able to compute the Adams consensus tree. Trees are input in a
tree file in standard nested-parenthesis notation, which is produced
by many of the tree estimation programs in the package. This program
can be used as the final step in doing bootstrap analyses for many
of the methods in the package.
RETREE
Reads in a tree (with branch lengths if necessary) and allows
you to reroot the tree, to flip branches, to change species names
and branch lengths, and then write the result out. Can be used to
convert between rooted and unrooted trees.
When you run most of these programs, a menu will appear offering you
choices of the various options available for that program. The data that the
program reads should be in an input file called (in most cases) "infile". If
there is no such file the programs will ask you for the name of the input file.
Below we describe the input file format, and then the menu.
6 13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC
The first line of the input file contains the number of species and the
number of characters, in free format, separated by blanks (not by
commas). The information for each species follows, starting with a
ten-character species name (which can include punctuation marks and blanks),
and continuing with the characters for that species. In the
discrete-character, DNA and protein sequence programs the characters are each a
single letter or digit, sometimes separated by blanks. In
Elementary Sequence Analysis
192 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
the continuous-characters programs they are real numbers with decimal points,
separated by blanks:
The conventions about continuing the data beyond one line per species are
different between the molecular sequence programs and the others. The
molecular sequence programs can take the data in "aligned" or "interleaved"
format, with some lines giving the first part of each of the sequences, then
lines giving the next part of each, and so on. Thus the sequences might look
like this:
6 39
Archaeopt CGATGCTTAC CGCCGATGCT
HesperorniCGTTACTCGT TGTCGTTACT
BaluchitheTAATGTTAAT TGTTAATGTT
B. virginiTAATGTTCGT TGTTAATGTT
BrontosaurCAAAACCCAT CATCAAAACC
B.subtilisGGCAGCCAAT CACGGCAGCC
TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC
Note that in these sequences we have a blank every ten sites to make them
easier to read: any such blanks are allowed. The blank line which separates
the two groups of lines (the ones containing sites 1-20 and ones containing
sites 21-39) may or may not be present, but if it is, it should be a line of
zero length and not contain any extra blank characters (this is because of a
limitation of the current versions of the programs). It is important that the
number of sites in each group be the same for all species (i.e., it will not be
possible to run the programs successfully if the first species line contains 20
bases, but the first line for the second species contains 21 bases).
In the sequential format, the character data can run on to a new line at
any time (except in a species name or in the case of continuous character and
distance matrix programs where you cannot go to a new line in the middle of a
real number). Thus it is legal to have:
Archaeopt 001100
1101
or even:
Archaeopt
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 193
0011001101
though note that the FULL ten characters of the species name MUST then be
present: in the above case there must be a blank after the "t". In all cases
it is possible to put internal blanks between any of the character values, so
that
is allowed.
If you make an error in the input file, the programs will often detect that
they have been fed an illegal character or illegal numerical value and issue an
error message such as "BAD CHARACTER STATE:", often printing out the bad value,
and sometimes the number of the species and character in which it occurred.
The program will then stop shortly after. One of the things which can lead to
a bad value is the omission of something earlier in the file, or the insertion
of something superfluous, which cause the reading of the file to get out of
synchronization. The program then starts reading things it didn’t expect, and
concludes that they are in error. So if you see this error message, you may
also want to look for the earlier problem that may have led to this.
The other major variation on the input data format is the options
information. Many options are selected using the menu, but a few are selected
by including extra information in the input file. Some options are described
below.
Are these settings correct? (type Y or the letter for one to change)
Elementary Sequence Analysis
194 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
If you want to accept the default settings (they are shown in the above case)
you can simply type "Y" followed by a carriage-return (Enter) character. If
you want to change any of the options, you should type the letter shown to the
left of its entry in the menu. For example, to set a threshold type "T".
Lower-case letters will also work. For many of the options the program will
ask for supplementary information, such as the value of the threshold.
Note the "Terminal type" entry, which you will find on all menus. It
allows you to specify which type of terminal your screen is. The options are
an IBM PC screen, an ANSI standard terminal (such as a DEC VT100), a DEC VT52-
compatible terminal, such as a Zenith Z29, or no terminal type. Choosing "0"
toggles among these four options in cyclical order, changing each time the "0"
option is chosen. If one of them is right for your terminal the screen will be
cleared before the menu is displayed. If none works the "none" option should
probably be chosen. Keep in mind that VT-52 compatible terminals can freeze up
if they receive the screen-clearing commands for the ANSI standard terminal!
If this is a problem it may be helpful to recompile the program, setting the
constants near its beginning so that the program starts up with the VT52 option
set.
The other numbered options control which information the program will
display on your screen or on the output files. The option to "Print
indications of progress of run" will show information such as the names of the
species as they are successively added to the tree, and the progress of global
rearrangements. You will usually want to see these as reassurance that the
program is running and to help you estimate how long it will take. But if you
are running the program "in background" as can be done on multitasking and
multiuser systems such as Unix, and do not have the program running in its own
window, you may want to turn this option off so that it does not disturb your
use of the computer while the program is running.
Most of the programs write their output onto a file called (usually)
"outfile", and a representation of the trees found onto a file called
"treefile".
The exact contents of the output file vary from program to program and
also depend on which menu options you have selected. For many programs, if you
select all possible output information, the output will consist of (1) the name
of the program and its version number, (2) the input information printed out,
(3) a series of phylogenies, some with associated information indicating how
much change there was in each character or on each part of the tree. A typical
rooted tree looks like this:
+-------------------Gibbon
+----------------------------2
! ! +------------------Orang
! +------4
! ! +---------Gorilla
+-----3 +--6
! ! ! +---------Chimp
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 195
! ! +----5
--1 ! +-----Human
! !
! +-----------------------------------------------Mouse
!
+------------------------------------------------Bovine
The warning message ("remember: ...") indicates that this is an unrooted tree
(mathematicians still call this a tree, though some systematists unfortunately
use the term "network". This conflicts with standard mathematical usage, which
reserves the name "network" for a completely different kind of graph). The
root of this tree could be anywhere, say on the line leading immediately to
Mouse. As an exercise, see if you can tell whether the following tree is or is
not a different one from the above:
+-----------------------------------------------Mouse
!
+---------4 +------------------Orang
! ! +------3
! ! ! ! +---------Chimp
---6 +----------------------------1 ! +----2
! ! +--5 +-----Human
! ! !
! ! +---------Gorilla
! !
! +-------------------Gibbon
!
+-------------------------------------------Bovine
(it is NOT different). It is IMPORTANT also to realize that the lengths of the
segments of the printed tree may not be significant: some may actually
represent branches of zero length, in the sense that there is no evidence that
the branches are nonzero in length. Some of the diagrams of trees attempt to
print branches approximately proportional to estimated branch lengths, while in
others the lengths are purely conventional and are presented just to make the
topology visible. You will have to look closely at the documentation that
accompanies each program to see what it presents and what is known about the
lengths of the branches on the tree. The above tree attempts to represent
branch lengths approximately in the diagram. But even in those cases, some of
the smaller branches are likely to be artificially lengthened to make the tree
topology clearer. Here is what a tree from DNAPARS looks like, when no attempt
is made to make the lengths of branches in the diagram proportional to
estimated branch lengths:
Elementary Sequence Analysis
196 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
+--Human
+--5
+--4 +--Chimp
! !
+--3 +-----Gorilla
! !
+--2 +--------Orang
! !
+--1 +-----------Gibbon
! !
--6 +--------------Mouse
!
+-----------------Bovine
Some of the parsimony programs in the package can print out a table of the
number of steps that different characters (or sites) require on the tree. This
table may not be obvious at first. A typical example looks like this:
The numbers across the top and down the side indicate which site is being
referred to. Thus site 23 is column "3" of row "20" and has 2 steps in this
case.
((Mouse,Bovine),((Orang,(Gorilla,(Chimp,Human))),Gibbon));
In the above tree the first fork separates the lineage leading to Mouse and
Bovine from the lineage leading to the rest. Within the latter group there is
a fork separating Gibbon from the rest, and so on. The entire tree is enclosed
in an outermost pair of parentheses. The tree ends with a semicolon. In some
programs such as DNAML, FITCH, and CONTML, the tree will be completely unrooted
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 197
(A,(B,(C,D)),(E,F));
The three "monophyletic" groups here are A, (B,C,D), and (E,F). The single
three-way split corresponds to one of the interior nodes of the unrooted tree
(it can be any interior node). The remaining forks are encountered as you move
out from that first node, and each then appears as a two-way split. You should
check the documentation files for the particular programs you are using to see
in which of these forms you can expect the user tree to be in. Note that many
of the programs that estimate an unrooted tree produce trees in the treefile in
rooted form! This is done for reasons of arbitrary internal bookkeeping. The
placement of the root is arbitrary.
For programs estimating branch lengths, these are given in the trees in
the tree file as real numbers following a colon, and placed immediately after
the group descended from that branch. Here is a typical tree with branch
lengths:
((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,
bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,
seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);
Note that the tree may continue to a new line at any time except in the middle
of a name or the middle of a branch length, although in trees written to the
tree file this will only be done after a comma.
Pattern Analysis
What is “random”? Intuitively, our idea of randomness is closely connected with homogeneity. Properties of a random
sequence should somehow look the same at different scales. If they don’t, we describe the sequence as “patchy”. All
genomes are complex and patchy. Some examples of DNA sequence heterogeneity are protein-coding regions, introns,
CpG islands and dispersed tandem repeats such as the 171 human alpha satellite repeat.
What forces create heterogeneity in DNA sequences? Mutation is often though of as random. However, it is a complex
process that does not occur uniformly across a genome. The process of replication, for example, may favor the expansion
of repetitive regions by slippage. Transcriptionally active DNA may be subject to different mutational forces than non-
transcribed regions. Regulatory elements may have different compositional requirements than coding regions. Natural
selection is strong force creating DNA heterogeneity. Protein-coding regions experience complex selection intensities that
vary among different codon positions and near splice junctions. Evolutionary history also affects sequence composition.
Bacterial genomes are a mosaic of resident and horizontally transferred segments. Regions recently acquired from another
organism with different base composition may appear as compositional heterogeneity.
Differences in nucleotide composition are observed within genomes as well as between genomes. Karlin and Brendel
(Science 259: 677-680, 1993) discussed the statistical analysis of DNA patchiness. Base content fluctuates at many
different scales. One example is the large (>100 kb) regions in vertebrate genomes called “isochores” (Bernardi, Annu.
Rev. Genetics 29: 445-476, 1995). Isochores are correlated with the staining properties of vertebrate chromosomes
(Giemsa-positive and -negative bands). They have been revealed by physical analysis of DNA fragments as well as from
DNA sequences (Ikemura et al., Genomics 8: 207-216, 1990). Genes tend to be concentrated in (G+C)-rich regions, but
both coding and non-coding portions are subject to similar influences on composition. DNA sequence analysis of one
isochore boundary indicated that it is sharp (Fukagawa et al., Genomics 25: 184-191; 1995). The origin of isochores is
not clear. Bernardi favors an evolutionary explanation based on composition differences between warm and cold-blooded
Elementary Sequence Analysis
200 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 10.1: Sequence walk plot of the lambda genome after Karlin and Brendel, 1993.
animals. (G+C)-rich isochores are prominent in mammals and birds, although gene clustering and composition patchiness
has also been observed in plants. Bernardi suggests that the (G+C)-rich isochores of mammals and birds originated about
200 million years ago from corresponding (G+C)-rich regions in their ancestors.
How can compositional patches be detected and what do they mean? These are questions that are actively pursued but not
satisfactorily answered. Sequence walks are a simple method used to detect patchiness (Karlin and Brendel, 1993). As
position is increased along a DNA sequence, the value of a variable is incremented +1 or 1 depending on a compositional
parameter. Figure 10.1 is a sequence walk plot for the bacteriophage lambda genome where +1 is taken if the position
is A or G (R=purine) and -1 if T or C (Y=pyrimidine) as described by Karlin & Brendel (1993). A randomly shuffled
lambda sequence shows a steady increase in R-Y, while the actual lambda sequence has a patchy distribution of purines
and pyrimidines.
Patchiness can also be visualized using a sliding window approach. Compositional parameters such as the (A+T) fraction
are evaluated within a window that slides along the DNA sequence. Figure 10.2 is an (A+T) plot for the E. coli K12
genome. No unusual features are revealed in spite of the fact that the K12 chromosome contains several horizontally
transferred regions.
Figure 10.2: (A+T) fraction of the E. coli K12 chromosome, window = 100,000 nt
For normalized dinucleotide frequencies in dsDNA, the forward strand is concatenated with its complement in the above
calculation. When the dinucleotide signature of XpY is >1.0, it is more frequent than expected from the nucleotide
composition, while ρXY < 1.0 indicates under-representation. Karlin et al. (1998) suggest that ρXY < 0.78 or ρXY >
1.23 in 50 kbp or more of DNA are significant.
Genomic signatures may be useful for determining similarity within broad groups of organisms. They may also be able
to detect horizontal transmission of DNA, provided the foreign DNA is from an organism with a different dinucleotide
signature. For example, GpC dinucleotides are over-represented in the E. coli genome but not in some other bacteria such
as Pseudomonas. There are many unexplained peculiarities about dinucleotide frequencies. For example, TpA is almost
universally under-represented in DNA. Although this was observed in the biochemical studies of the 1960s, it has never
been explained. The avoidance of CpG in vertebrate genomes is the one significant signature that has a theoretical basis.
Vertebrates, but not invertebrates, methylate CpG (CpG → 5m CpG). Deamination of 5m C produces T so that 5m CpG
frequently mutates to TpG (mismatch repair is unable to correct TG pairs). Presumably, as CpG methylation evolved, the
frequency of CpG dinucleotides decreased through mutation. With an important exception, CpG islands remain where
methylation does not occur (Bird, AP, Nature 321: 209-213, 1986, see Figure 10.3). These unmethylated CpG islands are
found in the 50 regions of many genes, especially those that are constitutively expressed. Interesting, these CpG islands
become hypermethylated in many tumors and gene expression is silenced (Esteller, M, Corn, PG, Baylin, SB and Herman,
JG, Cancer Res. 61: 3225-3229, 2001). CpG methylation cannot be the complete story for the wide avoidance of this
dinucleotide because CpG is also under-represented in mitochondrial genomes where it is not methylated.
Chargaff’s rules express the fact that double stranded DNA obeys Watson-Crick base pairing. The two stands of dsDNA
are sometimes labeled “Watson” and “Crick”. Chargaff’s first rules are Ac = Tw , Tc = Aw , Cc = Gw and Gc = Cw ,
where the letters represent the molar fraction of a base on one strand. These rules result from formation of Watson- Crick
base pairing between strands and are very precisely obeyed by dsDNA molecules.
Less well known are Chargaff’s second rules. These apply only approximately and separately to each of the two strands
Elementary Sequence Analysis
202 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 10.3: CpG islands in a 385 kbp segment of human DNA from chromosome 10 (Accession: AL031601). Dinu-
cleotide signature (ρGC for CpG), window = 1,000 nt
of dsDNA. They are: Ac ∼ Tc , Tw ∼ Aw , Cc ∼ Gc and Gw ∼ Cw . Chargaff’s second rules express the fact that
complementary strands are approximately symmetric in nucleotide content. If they are true, then Ac = Aw , Tc = Tw ,
Cc = Cw and Gc = Gw . Departures from strand symmetry (Chargaff asymmetry) are expressed by differences: (A-
T)/(A+T) and (G-C)/(G+C) on a single strand.
Strand symmetry originates from identical substitution processes affecting each strand, for example, when changing Ac →
Tc has the same probability as Aw → Tw . Under these circumstances, the number of AT base pairs will approximately
equal the number of TA base pairs (and likewise for GC and CG). However, some mutation processes are known to be
strand asymmetric (Francino and Ochman, Trends Genet. 13: 240-245, 1997). Furthermore, nucleotide substitution is
subjected to selection that may depend on information contained in only one strand.
The leading- and lagging-strands are replicated by different mechanisms. The leading-strand is copied by a continuous
process, while the lagging strand is synthesized discontinuously using multiple, short RNA primers. Additional enzymes
are needed to synthesize primers and then later remove them and fill in gaps. Leading- and lagging-strand replication may
involve different polymerases with disparate error rates. As well, the structure of the replication fork exposes the leading-
and lagging-strands to different environments. The lagging-strand is more open as a longer, single-stranded structure,
which could lead to increased DNA damage and repair.
Mutagenesis experiments in E. coli have shown that deletions and replication errors are more frequent on the lagging strand.
Differences depend on the agent inducing replication errors. Excess dTTP causes more errors on the lagging strand, while
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 203
Figure 10.4: Strand asymmetry for the Euglena gracilis chloroplast chromosome (Accession: X70810) after Morton
(1999). The chromosome is circular and strand asymmetry changes sign quickly at the replication origin and at a point
about 1800 from the origin. There are also peaks associated with the open reading frames and the three rRNA operons.
φAT = (fA − fT )/(fA + fT ) for window = 1,000 nt.
excess dCTP makes little difference. In general, it seems that Y ≥ R (pyrimidine ≥ purine) changes are more frequent on
the lagging-strand, causing an accumulation of purines.
Replication bias may cause a switch in Chargaff asymmetry across a replication origin because at this point the leading-
and lagging-strands change identity. An example is the Euglena gracilis chloroplast genome as reported by Brian Morton
(Proc. Natl. Acad. Sci. USA. 96: 5123-5128, 1999), see Figure 10.4.
Lobry (Mol. Biol. Evol. 13: 660-665, 1996) analyzed the chromosomes of several bacteria for replication bias. The
expected switch in strand asymmetry occurred across the replication origins. Changes in (G-C)/(G+C) were much more
dramatic than changes in (A-T)/(A+T). The replication effect was partly obscured by protein-coding sequences, which
introduce their own bias (see also the Euglena chromosome in Figure 10.4). Wherever one strand had a higher density of
coding sequences, that strand was found to increase G>C and T>A. Contrary to the expectation from mutagenesis, the
lagging-strand accumulated more A and C (instead of A and G).
No evidence has been found for replication bias in eukaryotes. Chargaff asymmetries switch rapidly over short regions of
the chromosome although they are generally higher around protein-coding exons. Apparently, the effect of mutational bias
and/or codon selection obscures the asymmetry (if any!) caused by a replication origin.
Transcription can also introduce Chargaff asymmetry since the two strands may be subject to different mutational effects.
During transcription, the non-template strand is in an open single-stranded conformation that is more sensitive to certain
mutations such as C ≥ T (U) deamination. The template strand, on the other hand may be subject to transcription-dependent
repair. DNA damage (for example a pyrimidine dimer) can stall the RNA polymerase and promote the action of nucleotide
excision repair. This repair may be error-prone, inducing mutations on the template strand. Or unrepaired damage on the
non-template strand may lead to substitution.
Elementary Sequence Analysis
204 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
The units of H are called “bits”. Since logarithms are additive, L in equation 10.3 can be removed (H/L) to give the
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 205
Figure 10.5: Amino acid complexity (Shannon-Weaver information content) in the Saccharomyces cerevisiae nuclear
localization sequence binding protein; Nsr1p (Accession: NP 011675). The dashed line shows the Shannon-Weaver index
for the entire protein sequence, the solid lines connect windows of 10 amino acids.
average value in bits per nucleotide (or amino acid) site. For a DNA sequence of length L containing four bases, the
maximum entropy occurs when each of the four bases has equal frequency. In this case,
1 1
Hmax = −L 4 log2 ( )
4 4
So Hmax = 2L bits or 2 bits per nucleotide site. Each nucleotide site can be represented by a two bit number (11, 10, 01,
00). This is the maximum complexity of a DNA message. Less complexity is contained in sequences that depart from equal
frequency. At the other extreme is a sequence comprised of a single base (pi = 1, H/L = 0). The Shannon-Weaver index
can be regarded as a measure of the complexity of a sequence. H/L = 0 represents a sequence of minimum complexity,
H/L = 2 bits has maximum possible complexity.
One way to think about the Shannon-Weaver index is in terms of uncertainty. Suppose the four bases are equally likely.
The uncertainty of a single base is 2 bits before it is read by a functional device (enzyme). After the base is decoded, its
uncertainty is zero. The information content of the message is the decrease in uncertainty as a result of decoding.
Information theory has been applied to the analysis of DNA and protein sequences in three ways.
1. Analyzing sequence complexity from the Shannon-Weaver indices of smaller DNA fragments (windows) contained
in a long sequence as was done in Figure 10.5.
2. Comparing homologous sites in a set of aligned sequences by means of their information content. That is, determin-
ing the complexity of homologous sites.
3. Examining the pattern of information content of a sequence divided into successively longer words (symbols) con-
sisting of a single base pairs, triplets and so forth. This is a method to look at clustering of nucleotides and will not
be considered.
Elementary Sequence Analysis
206 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 10.6: Nucleotide complexity (Shannon-Weaver information content H/L) of the D. melanogaster ADH gene region
(Accession: Z00030), windows of 100 nucleotides overlapped by 50 nucleotides.
An analysis of the D. melanogaster alcohol dehydrogenase (ADH) gene illustrates the application of information theory to
DNA sequence data (Figure 10.6).
The ADH gene lies within a 20 kb intron of a larger gene, outspread. Generally, maximum complexity is found in exons of
either ADH or outspread. In fact, the existence of the left-most exon in outspread was first deduced from an open reading
frame 50 to the ADH gene before the outspread gene had been mapped. Figure 10.6 also shows a correlation between
complexity and base composition. In principle, increasing the relative frequency of any of the nucleotides should have the
same effect, to decrease complexity. However, in this region of the Drosophila genome, only increased (A+T) decreases
complexity, while increased (G+C) has the opposite effect. High GC is associated with protein-coding exons while high
AT is associated with non-coding DNA such as introns. Although natural selection produces more constrained messages,
proteins do not usually use highly patterned or repetitive codon choices except where simple amino acid repeats are found
(see Figure 10.5). The Shannon-Weaver index reaches nearly the maximum value of 2 bits per site for the protein-coding
exons of these two Drosophila genes. Regions of repetitive DNA, on the other hand, have low complexity. In the ADH
region of the Drosophila genome, these are associated with AT-rich sequences. It is also interesting that intron DNA
between ADH and outspread exons appears to be organized into sub-regions with different complexities. It remains to
be seen if intronic regions of high complexity and GC content are functional and constrained by natural selection, as are
protein-coding exons, or simply a different kind of neutral DNA.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 207
Figure 10.7: Consensus sequence analyses of E. coli promoters. +1 is the transcriptional start position.
Promoter sequences, in conjunction with other DNA elements and proteins, activate RNA polymerase binding and tran-
scription. E. coli promoter elements are recognized by an RNA polymerase holoenzyme which contains a bound sigma
factor (core enzyme plus sigma factor = holoenzyme). The sigma factor is though to provide most of the sequence recogni-
tion capability of the holoenzyme. E. coli has a number of different sigma factors, each associated with a specific promoter
consensus sequence (Figure 10.7).
The consensus sequnce is defined by majority rule. Analysis of the sigma-70 promoter by Lisser and Margalit (Nucleic
Acids Res. 21: 1507-1516, 1993) revealed the consensus sequence shown in Figure 10.7. A pattern search for the sigma-70
promoter based on the consensus sequence would look for TTGACA (N)n=15−19 TATAAT.
A major drawback to using the consensus sequence in pattern matching is that rarely will an actual promoter perfectly match
the consensus sequence. No known sigma-70 promoter matches the consensus sequence at all 12 nucleotides (although
this pattern does occur in the E. coli genome). Thus, searching for the consensus sigma-70 promoter sequence in front of
Elementary Sequence Analysis
208 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
genes is an exercise in futility. The search must be for “something like” the consensus sequence. But how alike?
Variation found among the promoters of individual E. coli genes is indicated under the majority base in Figure 10.7. Some
sites in the promoter sequence are more conserved than others. The cause of variation, however, is unknown. It could
be due to mutational drift under the influence of selection. There may also be gene-specific effects. For example, genes
requiring lower expression may use “weaker” promoters.
It is possible to take variation into account in a pattern search by defining alternative nucleotides. For example, if the most
frequent alternative to the first T in TTGACA is A, the pattern search could be for (T/A)TGACA. Problems with simple
pattern searches are obvious. The number of possible patterns grows exponentially with alternatives, but all of them are not
equally useful as matches. A pattern with 10 mismatches from the consensus is probably not a promoter, but one with two
mismatches might be. To account for this, pattern-matching programs will allow up to a specified number of mismatches.
Another problem is that there may be no clear alternatives to the consensus nucleotide. This is the case with the E. coli
sigma-70 promoter where minority nucleotides are more-or-less evenly distributed (Table 10.1).
Base T T G A C A T A T A A T
A 0.10 0.06 0.09 0.56 0.21 0.54 0.05 0.76 0.15 0.61 0.56 0.06
C 0.10 0.07 0.12 0.17 0.54 0.13 0.10 0.06 0.11 0.13 0.20 0.07
G 0.10 0.08 0.61 0.11 0.09 0.16 0.08 0.06 0.14 0.14 0.08 0.05
T 0.69 0.79 0.18 0.16 0.16 0.17 0.77 0.12 0.60 0.12 0.15 0.82
Table 10.1: Fractional occurrence of nucleotides at each position for 298 E. coli sigma-70 promoters (Lisser and Margalit,
1993)
Win is the scoring matrix element at the ith position in the pattern for the nth type of nucleotide (G, C, A, or T); Fin is
the frequency of the nth nucleotide at the ith position among the group of patterns used to derive the consensus sequence;
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 209
pn is the probability that the nth nucleotide occurs by chance. For example, among the group of promoters used to derive
the sigma-70 consensus sequence in Figure 10.7, the T at -10 (TATAAT) occurs 82% of the time (Table 10.1). The scoring
element for a T at this position is
WT T = log10 (0.82/0.25) = 0.516 (10.5)
(assuming that T occurs with a frequency of 1/4 in the E. coli genome).
Scores for DNA patterns can also be obtained using neural network methods. Examples of such techniques are discussed
in Hénaut and Danchin (Analysis and predictions from Escherichia coli sequences, or E. coli in silico In: E. coli and
Salmonella Vol. II, Chapter 114: 2047-2066, 1992). A computer program is “trained” on examples of good and bad
promoters. Matrix elements are flexible and optimized to discriminate between the training set. Such methods do not
usually give appreciably better results than the maximum likelihood approach. However, they can be more easily adapted
to include additional information about what makes a good promoter. Many promoters require several proteins to initiate
transcription. These recognize other DNA sequence motifs, usually located near the sigma factor binding site. DNA
curvature is often a factor. Upstream sequences that bend DNA increase the activity of some promoters (Travers, Cell 60:
177-180, 1990). DNA bending depends mainly on runs of A or T since the dinucleotide AA/TT has the largest tilt angle
(Trifonov, CRC Revs. Biochem. 19: 89-106, 1985). DNA curvature can be calculated by accumulating AA and TT pairs.
DNA curvature is more easily incorporated into the analysis of promoter scores by using training methods.
Ri is the information content of the site. Hmax is the maximum uncertainty, 2 bits if the four bases are equally probable
before the site is decoded (see section 10.5.1). After decoding (e.g., by the RNA polymerase for promoter sequences), the
uncertainty (Hi ) is given by equation 10.3 with pi are nucleotide frequencies calculated from each position of the aligned,
example sequences. e(N) is a correction factor to account for the fact that only a finite number of example sequences
(N) are used to estimate the information content of the binding site (see Schneider et al, 1986). Figure 10.9 illustrates the
method by analysis of the E. coli FIS binding site using data from Hengen et al (Nucleic Acids Res. 25: 4994- 5002, 1997).
Fis binds to and bends DNA at specific sites. It regulates the transcription of a subset of genes in conjunction with RNA
polymerase, and is also involved in the process of recombination. Fis sites have been identified in a number of genes as
well as the fis gene itself. Some genes (e.g., fis ) have a cluster of sites in their promoter region. Figure 10.9 displays
the information content of the FIS binding site as a “sequence logo”, where each consensus nucleotide is given a size
proportional to its information. Hengen et al. analyze 60 example sequences (30 sites in both directions since the Fis site
is known to be symmetrical) from which I selected 10 for illustration in Figure 10.9. Sequence logos can be constructed at
the internet site: http://weblogo.berkeley.edu/.
An advantage of sequence logos is that sequence conservation can be quantitatively interpreted as the information that the
decoder (e.g., Fis protein) obtains from potential sites in order to recognize a valid site. For example, the information
content of the two GC base pairs in the FIS binding site is approximately 2 bits, close to the maximum information
available. The FIS protein contacts the major groove of dsDNA at these positions and can obtain information about base
pair identity (e.g., CG vs GC). On the other hand, in the central region of the FIS binding site, the protein contacts the
Elementary Sequence Analysis
210 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 10.9: Analysis of ten FIS binding sites. The consensus is shown at top and the ‘logo’ at the bottom.
Consensus A A C G C T C A A A A A T T G A C C A A A
fis T T T G C C G A T T A T T T A C G C A A A
oriC A C A A C T C A A A A A C T G A A C A A C
rrnB A A C G G G C A A T A A T T G T T C A G C
tufB G A T G T T G A A A A A G T G T G C T A A
tyrT G G C G A T T A A A G A A T A A T C G T T
nrd A C C G A A T A G A A A A C A A C C A T T
tgt T G A G C T A A A A A A T T C A T C G A T
aldB G C T G C G C G A T A A A T C G C C A C A
proP A A A G G T C A T T A A C T G C C C A A T
hin A G C G A C T A A A A T T C T T C C T T A
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 211
minor groove and can only distinguish GC from AT pairs, but not their orientation. The information available in this region
is approximately 1 bit.
The information content of a binding site can be calculated by summing the information at each position. It is approxi-
mately 9 bits for the Fis consensus sequence (Hengen et al., 1997). This contrasts with a maximum of 21 × 2 = 42 bits
of information available in a 21 bp binding site. The FIS protein uses only a faction of the this information in order to
recognize a site. Nine bits of information is sufficient to allow approximately 16,000 sites to be distinguished in the E. coli
genome [9 = − log2 (x/G), where G is the genome nucleotide content, about 8 × 106 nucleotides because each nucleotide
begins a potential site (see Schneider et al., 1986). The number of nucleotides in the E. coli genome was doubled be-
cause the Fis site is symmetric. Solution gives x = 16, 000]. More stringent binding site recognition requires that more
information to be used by the protein.
The total information of a potential binding site can be calculated using a scoring matrix derived from equation 10.7.
Wbj is the matrix element for nucleotide of type b at position j in the pattern. Fbj is the frequency of this nucleotide in
the example set at the same position, and e(N) is a correction for the finite size (N) of the example set. The information
content of a test pattern is obtained by using its sequence in equation 10.7. Hengen et al. (1997) used this approach to scan
the E. coli genome for FIS binding sites. A sliding window of 21 nucleotides was moved along the genome sequence and
the information content of potential sites evaluated. Segments with information above 2 bits were considered potential FIS
sites.
Elementary Sequence Analysis
212 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Chapter 11
Exon Analysis
Locating protein-coding genes is an important goal of genomics. This, together with locating RNA genes and regulatory
elements is the process of annotation. Annotating DNA is based on three tools; 1) aligning cDNA with genomic DNA,
2) similarity to previously identified genes and 3) theoretical prediction. Annotating the human genome is an ongoing
process. The 3 × 109 bp of DNA is estimated to contain the order of 3 × 104 genes. Approximately 1 × 104 complete
cDNA sequences have so far been identified. It is likely that complete cDNA sequences will never be obtained for all genes
so that computational techniques will be necessary to obtain a complete understanding of its coding potential.
1. Sequencing errors, internal “stop” codons that are removed by editing, and codons for selenomethionine.
2. Spurious ORFs that are not part of any protein-coding gene. The non-coding strand of exons often contains ORFs.
That is, the reverse complements of stop codons (TTA [TAA], CTA [TAG], TCA [TGA]) are often statistically
avoided, creating ORFs on the complementary strand.
3. Intron exon structure combines several ORFs into a single gene. The splice junction fusion may create the in-phase
codon.
4. Splicing creates multiple transcripts and multiple proteins. Certain exons may only be used in a subset of transcripts.
The D. melanogaster Adh gene, for example has different transcripts during larval and adult phases of growth.
Figure 11.1: Eukaryotic gene with exon - intron structure, protein-coding is gray.
Both of these approaches are combined in annotation software that is used for complete genome annotation. An example
is GenomeScan, used extensively to annotate the human genome sequence (Yeh et al. 2001. Genome Res. 11: 803-816).
Gene prediction combines the location of ORFs with other sequence information to make a model of the entire gene. Data
about possible promoters, transcription initiation (cap sites), translation signals (initiation, termination codons), splice
signals, and transcription terminators are combined to make an inference that rejects unlikely ORFs and includes likely
ORFs in a consistent gene model (Figure 11.1).
Gene prediction algorithms calculate an overall statistic and make a decision as to whether or not to present the model as
a potential gene. Neural network methods are often used in which the algorithm is trained on a set of test genes and learns
what weights should be assigned to the various measures in order give the best discrimination between valid and invalid
test genes.
The ability of various approaches to predict protein-coding genes was assessed by Fickett and Tung (1992. Nucleic Acids
Res. 20: 6441-6450). They identified several features that are particularly useful.
1. Codon usage. A codon usage vector (frequencies of the 64 possible codons) for a potential exon is compared to
that of a reference sets of genes, preferably from the same or closely related organism. Methods differ in how the
reference set is obtained and how the measure of fit is calculated. Reference sets that incorporate information about
the amino acid composition of the potential gene are superior to those that do not.
2. In-phase words. A vector similar to the codon vector is calculated for longer words (oligonucleotides of length n).
Hexamers have proven useful. These take into account tendencies of codon use to be correlated over short ranges
(e.g., a codon ending in G tends not to be followed by one beginning in G).
3. The presence of STOP codons. Most methods only consider ORFs. However, it is possible to incorporate stop
codons into a measure of amino acid content.
4. Amino acid content. Measures of protein function, such as vectors of amino acids, dipeptides and hydrophobicity,
can be obtained for a potential exon. Like the codon usage vectors, these are compared to a reference set. This,
however, may limit identification to particular types of protein-coding genes.
5. Nucleotide periodicity. Nucleotides do not appear at random in coding sequences (nor in non-coding ones). The
statistical average codon is RNY, leading to a periodicity of 3 nucleotides. Periodicity vectors are calculated for
potential exons (e.g., using Fourier transforms or autocorrelation functions).
Codon use and nucleotide periodicities are interdependent properties of protein-coding regions that influence exon predic-
tion.
Base Composition. Base composition is a major factor influencing codon usage. Organisms, especially bacteria, have
variable GC content. This alters both the types of amino acids and the codons used to code for these amino acids As an
example, AAA and AAG both code for lysine. As the genome content of (A+T) increases, proteins tend to use more lysine
and more AAA codons (Figure 11.3).
This trend across genomes is repeated within a genome across different genes, although with much more variability. To
illustrate, E. coli genes that have greater (A+T) content tend to use more AAA codons (Figure 11.4).
Mutational bias is thought have a major effect in determining overall base composition. Other influences, such as selection
for compact genome size, have also been suggested.
Mutational bias could reflect replication error, repair efficiency, nucleotide pools or other, unidentified factors. The causes
of high, low or intermediate GC content among organisms are not known. Neither are the causes of variation among genes
within a genome. Amino acid composition is an obvious possibility, but even with constant composition, GC content can
vary because of synonymous codon choice. The problem of GC content and codon choice is a chicken-or-egg situation.
They are correlated, but which is driving which and what are the underlying forces?
Codon Position. Codon choice is patterned differently at each of the three codon positions (c1, c2, c3). Figure 11.5 shows
nucleotide choices for E. coli. The average nucleotide frequencies of all genes are to the right of histograms showing
deviations from this average at each position.
In all organisms, G is preferred in the first position. T and, less obviously for E. coli, A are avoided. The second position
is less consistent, but A is often preferred, especially at moderate or high GC content. The third (synonymous) position
shows most clearly the effect of variable GC content. In organisms with high GC content, G and C are preferred in the
third position, but are avoided in organisms with high AT content. In E. coli, which has an even distribution of nucleotides,
G and C are slightly preferred and A slightly avoided.
The choice of codon at the second position is very dependent on the hydrophopicity of the protein because of a pattern
in the universal genetic code. T (U in RNA) at c2 is confined to hydrophobic amino acids, while A at c2 is confined to
hydrophilic ones.
The effect of this bias in the genetic code is clearly seen in the distribution of nucleotides at c2 in the E. coli genome
(Figure 11.6). There is a peak of relatively hydrophobic proteins that prefer T (U in RNA) instead of A.
The patterns of codon use described above are complex, but they are not taken individually into account by gene prediction
programs. Rather they create trends in protein-coding regions that are utilized by algorithms as frequency distributions of
“words” (for example, hexamer frequencies).
Elementary Sequence Analysis
216 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Figure 11.3: The fraction of all codons that are AAA across genomes with different AT contents.
Figure 11.4: The fraction of codons that are AAA for genes of the E. coli genome as a function of the gene’s (A+T)
fraction.
Elementary Sequence Analysis
edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017 217
Figure 11.6: The relative content of T to (T+A) at the second position of 3180 E. coli genes.
Elementary Sequence Analysis
218 edited by Brian Golding, Dick Morton and Wilfried Haerty August 2017
Many programs are available to build gene models, such as FGENEH, GENMARK, GRAIL, GeneParser. Burset and
Guigo (1996. Genomics 34: 353-367) and Guigó et al. (2000. Genome Res. 10: 1631-1642) compared many of them and
found that their accuracy is often overrated because they have been evaluated on genes similar to the test set used to build
the discrimination functions. Three of the most commonly used programs are summarized.
GeneFinder is a group of programs for gene identification written by Victor Solovyev’s group of the Computational
Genomics Group at the Sanger Centre (Solovyev V and Salamov A. 1997. Proc Int Conf Intell Syst Mol Biol 5:294-302).
They were used to predict genes in the Drosophila genome (Solovyev V and Salamov A. 2000. Genome Res. 10: 516-
22). The software can be accessed for testing at the commercial site http://www.softberry.com/berry.phtml. FGENES is
designed to identify and piece exons together to predict multiple genes on both strands. There is a version, FGENES-M
that predicts multiple models of a single gene, useful if there are alternate splice forms. FGENESH is a variant using
a Hidden Markov Model (HMM, section 11.2.4). FGENESH+ is a program that uses a protein sequence similar to the
predicted gene product (possibly obtained from BLAST) in conjunction with FGENESH to more accurately predict exon
structure.
FGENES relies on identifying exon donor and acceptor splice sites as described by Solovyev et al. (1994. Nucleic Acids
Res. 22: 5156-5163). Flanking (50 and 30 ) and internal exons are treated with separate algorithms. The program
P examines
each ORF that terminates in a GT or begins with AG and calculates a linear discriminant function, z = αi xi , where
xi are measures of a splice site and αi are weights. The discriminant function is used to classify an exon as valid if z is
above a critical value determined from the analysis of test (learning) data. The measures in the discriminate function are
triplet nucleotides frequencies at the exon-intron boundaries. Because these are organism dependent, discriminant function
weights must be obtained for each species or from a closely-related relative.
GENIE is a program written by the Computational Biology Group at the University of California, Santa Cruz and the
Genomic Informatics Group at LBNL (Kulp D, Haussler D, Reese MG, Eeckman FH. 1996. Proc. Int. Conf. Intell. Syst.
Mol Biol. 4:134-42). It uses a Generalized Hidden Markov Model (HMM, section 11.2.4) to develop gene models. It has
been extensively used to predict genes in the human and fruitfly genomes (Reese MG, Kulp D, Tammana H, Haussler D.
2000. Genome Res. 10:529-38). The web version of Genie is available through the Berkeley Drosophila Genome Project
(http://www.fruitfly.org/seq tools/genie.html).
GENSCAN is a program developed by Burge and Karlin (1997. J. Mol. Biol. 268: 78-94). Although designed for human
genes, it has been tested successfully on other vertebrate sequences and plants. It also works for Drosophila. A large,
non-redundant set of human genes (2.58 × 106 nucleotides containing 1492 exons and 1254 introns) was used to develop
GENSCAN. GENSCAN is generally regarded as one of the best gene prediction programs and has been extensively used
in the human genome project. It incorporates a number of features to build a model.
1. Transcriptional and translational signals are evaluated by weight matrices. Potential signals are: polyadenylation, cap
site, promoter (both TATA and TATA-less promoters are allowed with variable distance to the cap site), translational
start sites (6 nt prior to start codon) and stop sites (3 nt following stop codon).
2. Splice signals. A modified weight matrix method is used to examine potential splice sites (3 nt in exon, 6 nt in
intron). The modified method takes into account correlations between positions.
3. Exon models. Potential coding portions of exons are evaluated using a Markov model. This computes transition
probability matrices for hexamers ending at each codon position. Scores are dependent on similarity between the
GC-content of the training sequences and the sequence to be evaluated. GENSCAN uses one of two sets of expected
transition probabilities that were generated from training sets having either GC < 43% or GC > 43%.
but more comprehensive and designed for genome annotation. It has been used in annotating the human genome project. It
may be accessed at http://genes.mit.edu/genomescan.html. In the web version, you are required to input a similar protein
sequence (rather than having the program obtain sequences from BLASTX).