Glycopp: A Webserver For Prediction of N-And O - Glycosites in Prokaryotic Protein Sequences

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

GlycoPP: A Webserver for Prediction of N- and O-

Glycosites in Prokaryotic Protein Sequences


Jagat S. Chauhan1, Adil H. Bhat2, Gajendra P. S. Raghava1*, Alka Rao2*
1 Bioinformatics Centre, Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India, 2 Protein Science and Engineering, Institute of
Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India

Abstract
Glycosylation is one of the most abundant post-translational modifications (PTMs) required for various structure/function
modulations of proteins in a living cell. Although elucidated recently in prokaryotes, this type of PTM is present across all
three domains of life. In prokaryotes, two types of protein glycan linkages are more widespread namely, N- linked, where a
glycan moiety is attached to the amide group of Asn, and O- linked, where a glycan moiety is attached to the hydroxyl
group of Ser/Thr/Tyr. For their biologically ubiquitous nature, significance, and technology applications, the study of
prokaryotic glycoproteins is a fast emerging area of research. Here we describe new Support Vector Machine (SVM) based
algorithms (models) developed for predicting glycosylated-residues (glycosites) with high accuracy in prokaryotic protein
sequences. The models are based on binary profile of patterns, composition profile of patterns, and position-specific scoring
matrix profile of patterns as training features. The study employ an extensive dataset of 107 N-linked and 116 O-linked
glycosites extracted from 59 experimentally characterized glycoproteins of prokaryotes. This dataset includes validated N-
glycosites from phyla Crenarchaeota, Euryarchaeota (domain Archaea), Proteobacteria (domain Bacteria) and validated O-
glycosites from phyla Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria (domain Bacteria). In view of the current
understanding that glycosylation occurs on folded proteins in bacteria, hybrid models have been developed using
information on predicted secondary structures and accessible surface area in various combinations with training features.
Using these models, N-glycosites and O-glycosites could be predicted with an accuracy of 82.71% (MCC 0.65) and 73.71%
(MCC 0.48), respectively. An evaluation of the best performing models with 28 independent prokaryotic glycoproteins
confirms the suitability of these models in predicting N- and O-glycosites in potential glycoproteins from aforementioned
organisms, with reasonably high confidence. A web server GlycoPP, implementing these models is available freely at http:/
www.imtech.res.in/raghava/glycopp/.

Citation: Chauhan JS, Bhat AH, Raghava GPS, Rao A (2012) GlycoPP: A Webserver for Prediction of N- and O-Glycosites in Prokaryotic Protein Sequences. PLoS
ONE 7(7): e40155. doi:10.1371/journal.pone.0040155
Editor: Joy Marilyn Burchell, King’s College London, United Kingdom
Received March 14, 2012; Accepted June 1, 2012; Published July 9, 2012
Copyright: ß 2012 Chauhan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: AR and GPSR acknowledge the financial support from Council of Scientific and Industrial Research (CSIR) and Department of Biotechnology (DBT),
Government of India, respectively. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected] (AR); [email protected] (GPSR)

Introduction technically demanding, and time-consuming owing to the labile


nature of modification involved as well as lack of high-senstivity yet
Glycosylation is a recently identified post-translational modifi- cost-effective methods for glycoprotein detection. Therefore, the
cation of proteins in prokaryotes: Archaea and Bacteria [1,2]. A computational algorithms/models to predict glycosites in protein
glycan moiety is attached enzymatically to a protein by the process sequences are very useful in complementing and facilitating such
of glycosylation. Glycosylation is known to influence biological studies. A number of such algorithms have been developed to
properties like activity, solubility, folding, conformation, stability, predict glycosites in eukaryotic glycoproteins using different tools
half-life, and/or immunogenicity of different cellular proteins of machine learning like Neural Network based (NetOglyc),
thereby modulating the structure/function of these proteins for a [11,12] Support Vector Machine (SVM) based (NetNglyc), [13],
variety of cellular/extracellular functions in a living cell [3–5]. Ensemble of SVMs (EnsembleGly), [14] and Random Forest
Owing to their involvement in host-pathogen interactions, based [15]. All these existing tools are trained on eukaryotic
immunogenicity and in many other important cellular functions, glycoprotein sequences. However, for non-availability of equiva-
a number of bacterial and archaeal glycoproteins have been lent methods, these tools are routinely used for analyzing
characterized experimentally [6–10]. Determination of glycosite(s) glycoproteomics data and potential glycosite analysis in prokary-
is one important aspect of glycoprotein characterization. Analysis otic glycoproteins for both N- and O- type of glycosylation [16–
of glysosites and their neighboring sequence and structural 19]. In similar context, Dell and co-workers have discussed the
contexts may also provide important evolutionary insights and unsuitability of these existing glycosite prediction tools in correctly
understanding of acceptor specificities of the protein glycosylating predicting glycosites (especially O-glycosites), in most families of
enzymes called glycosyltransferases (GTs) and oligosaccharyltrans- characterized prokaryotic glycoproteins that included pilins,
ferases (OSTs), [2]. The experimental characterization of flagellins, autotransporters and serine-rich proteins [20]. In this
glycosite(s) and the glycoproteins, however, could be difficult, study using a dataset of experimentally validated 107 N-linked and

PLoS ONE | www.plosone.org 1 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

116 O-linked glycosites from archaeal and bacterial glycoproteins, number of SVM models using three types of features namely,
we have found that indeed these tools (as detailed in Table 1), fail binary profile of patterns (BPP), composition profile of patterns
to provide reliable predictions for glycosites in prokaryotic (CPP), and PSI-BLAST generated PSSM profile of patterns (PPP)
glycoproteins. Furthermore, protein glycosylation in prokaryotes to recognize and differentiate glycosylated sequence contexts from
is much more versatile than in eukaryotes in terms of both non-glycosylated contexts in prokaryotic glycoproteins. For the
mechanisms involved and the types of glycans and linkages present reasons that mere presence of a consensus-sequon/pattern may
as discussed in references [20–22] & Table 2. Among archaea N- not always be sufficient for glycosylation to occur and that the
glycosylation is believed to be widespread yet experimental glycosites are predominantly situated on loops/accessible portions
evidence exists only in case of phyla Crenarchaeota and Euryarchaeota of folded proteins in prokaryotes, we have employed predicted
where it is mediated by an enzyme AglB and its homologues and secondary structure and surface accessibility features in combina-
sugar is transferred on to NX(S/T)(where X?P) acceptor sequon tion with BPP, CPP and PPP for developing hybrid models
in an en-bloc fashion. Similarly, in bacteria N-glycosylation is (Table 3), [21,28]. The best performing and significantly accurate
known and experimentally validated only in a few organisms models were then evaluated against an independent dataset of
belonging to phylum Proteobacteria. In Proteobacteria both sequentially experimentally validated glycosites and finally implemented via
(in cytoplasm, ex. Haemophilus influenzae) and en-bloc glycosylated (in web server GlycoPP (Figure 1) made available through open access
periplasm, ex. Campylobacter jejuni) proteins have been characterized at http:/www.imtech.res.in/raghava/glycopp/.
in different organisms. Similarly, experimental data on O-
glycosites is available only from four bacterial phyla namely, Methods
Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria out of eleven
bacterial phyla where glycoproteins are known to exist. Interesting Dataset Generation
novel ‘‘conserved sequences of amino acids around glycosites Source of data. The primary set consisted of 39 N-linked and
(sequons)’’ like (D/E)X1NX(S/T)(where X1 & X?P) for N- 54 O-linked glycoproteins obtained from the first release
glycosylation and D(S/T)(A/I/L/V/M/T) for O-glycosylation (July_2011) of ProGlycProt database [6]. For the reason that
have been elucidated within these glycoproteins that are not yet number of experimentally validated proteins is not very high, all
seen in eukaryotes [23,24]. A tool to map such sequons in amino the available N-linked and O-linked glycoprotein entries in
acid sequence(s) of protein/proteomes has recently been made primary dataset have been taken in to account for this study.
available by our group [6]. Further, an analysis of amino acid However, entries containing only cysteine-linked (S-linked)
sequences surrounding N-glycosites of available archaeal glyco- glycosites as well as all glyco-engineered protein/peptides have
proteins (10 at that time) by Abu-Qarn and co-workers has also been excluded from the primary set resulting into a total of 38 N-
shown that archaeal N-glycosites are rarely surrounded by linked and 48 O-linked glycoproteins for further consideration.
aromatic residues that are in abundance at positions –2 and –1 Some of these glycoproteins are N- as well as O-glycosylated.
preceding glycosylated Asn at postion 0 in eukaryotic N-glycosites These glycoproteins include a variety of important proteins like S-
[25–27]. For these reasons, the development of separate and new layer proteins, flagellar proteins, pili/fimbrial proteins, lectins,
algorithms for prediction of glycosites in prokaryotes is of high adhesions, glycosidases, Cytochrome hemoprotein, heparinase,
interest and need [20]. Chondroitinase as well as several known-unknown cytoplasmic,
Therefore in this study, we have attempted to analyze the membrane bound and exported proteins (Table 2). These
sequence context, predicted secondary structure and surface glycoproteins represent all types of known N-glycosylation in
accessibility of the experimentally verified glycosites in the largest prokaryotes representing organisms from phylum Crenarchaeota and
available dataset of 107 N-linked glycosylated-residues (N- Euryarchaeota of Archaea and phylum Proteobacteria of Bacteria.
glycosites) and 116 O-linked glycosylated-residues (O-glycosites) Similarly, this dataset represents all available validated examples of
from 59 prokaryotic glycoproteins retrieved from our recently O-glycosylated proteins from four phyla namely, Actinobacteria,
published database of experimentally characterized prokaryotic Bacteroidetes, Firmicutes and Proteobacteria of Bacteria. In Archaea no
glycoproteins, ProGlycProt [6]. In this study, we have developed a experimentally validated data exists for O-glycosites, so far.

Table 1. An evaluation of performances of some of the well-known models for glycosylation prediction on prokaryotic
glycoproteins.

Type of Glycosylation Prediction Tools Threshold Sensitivity (%) Specificity (%) Accuracy (%) MCC (%)
1
N-linked NetNglyc 0.5 81.75 10.16 34.41 20.11
0.6 50.79 42.68 45.43 20.06
0.7 15.87 76.02 55.65 20.09
0.9 0.79 98.37 65.32 20.03
EnsembleGly3 0.3 92.86 0.41 31.72 20.2
0.5 90.48 1.63 31.72 20.18
0.7 79.37 10.98 34.14 20.13
0.9 51.59 47.15 48.66 20.01
O-linked NetOglyc2 0 8.38 95.64 87.96 0.05
EnsembleGly3 0 9.5 93.37 86 0.03

Footnotes: (1: http://www.cbs.dtu.dk/services/NetNGlyc/, 2: http://www.cbs.dtu.dk/services/NetOGlyc-3.0/, 3: http://turing.cs.iastate.edu/EnsembleGly/).


doi:10.1371/journal.pone.0040155.t001

PLoS ONE | www.plosone.org 2 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

Table 2. Experimentally characterized glycan linkages at known glycosites of bacteria and archaea.

Sugar linkage Class Example glycoproteins

N-LINKED GLYCANS/ARCHAEA
Glc-Asn Halobacteria Flagellin, Slg
bGalNAc- Asn, Halobacteria, Methanococci, Methanobacteria Thermoprotei Flagellin, Slg, Cytochrome subunit
N-LINKED GLYCANS/BACTERIA
Bac-Asn Epsilonproteobacteria AcrA, PEB3, CgpA, HisJ, ZnuA, jlpA etc.
GlcNAc-Asn Deltaproteobacteria HmcA
Hexose-Asn, dihexose-Asn, Glu-Asn, Gammaproteobacteria Adhesins
Gal-Asn
O-LINKED GLYCANS/BACTERIA
Man-Ser/Thr Actinobacteria, Flavobacteria, Sphingobacteria Glycosidases, Cell surface lipoproteins,
Secreted antigens, Superoxide dismutase, Heparinase,
Chondroitinase etc.
Fucose Bacteroidia Putative cell division proteins, exported proteins, outer
membrane proteins etc.
b-GalNAc-Ser/Thr Bacilli Slg
b-D-Gal-Ser/Thr Bacilli Slg, SgsE, SgtA etc.
b-GlcNAc-Ser/Thr, HexNAc Bacilli Glycocin F, Flagellin
Bac/DATDH-Ser Betaproteobacteria Pilin, CcoP, CycB etc.
FucNAc-Ser Gammaproteobacteria Pilin
Rha-Ser/Thr, Deoxyhexose-Ser Gammaproteobacteria Flagellin

Footnotes: Detailed information about attached glycan and glycoproteins can be obtained from www.proglycprot.org).
doi:10.1371/journal.pone.0040155.t002

Further, within these glycoproteins, at least 30 N-linked and 40 O- learning and in avoiding biases in machine learning that are
linked glycoproteins have less than 40% sequence similarity to common in case of realistic dataset.
each other as deduced from CD-HIT v 4.0 available at http:// Realistic datasets contained all glycosylated/positive (107 N-
www.bioinformatics.org/cd-hit/. From the primary dataset, 59 linked & 116 O-linked) and all non-glycosylated/negative (995 N-
glycoproteins (16 archaeal and 43 bacterial) with higher number of linked & 2018 O-linked) sites from glycoprotein sequences.
characterized glycosites were hand- picked to form main datasets Performance of SVM on realistic (unbalanced) datasets could
whereas remaining 27 (5 archaeal and 22 bacterial) glycoproteins provide more confidence in predictions from real-time data where
were used as independent datasets of N and O glycosites, usually the non-glycosylated residues are much more than the
separately. glycosylated ones in a protein sequence.
Main datasets. The Main datasets represent the training Independent datasets. The independent balanced datasets
datasets employed in profile generation and later machine of 28 (10 N-linked & 17 O-linked) glycoproteins with experimen-
learning. The datasets contain 28 N-linked (overall sequence tally validated 19 N-glycosites and 61 O-glycosites (with equivalent
similarity less than 70%) and 31 O-linked (overall sequence numbers of non-glycosylated sites) were used as test datasets in this
similarity less than 90%) glycoproteins from prokaryotes. Using study for evaluating the models trained on main datasets. Within
CD-HIT it has been deduced that in the main datasets, at least 23 these at least 7 N-linked and 14 O-linked glycoproteins show less
N-linked and 26 O-linked glycoproteins have less than 40% than 40% sequence similarity to each other.
sequence similarity to each other. From this set of glycoproteins,
all N-and O-glycosites were retrieved and segregated in to separate Pattern Generation and Feature Calculations
datasets. All probable or predicted glycosites were excluded. Various overlapping symmetrical sequence patterns of resi-
Finally, the main datasets contained well-annotated unambiguous dues length 21 that included central glycosylated residues were
107 N-linked and 116 O-linked glycosites derived from 59 constructed according to previous studies [14,15]. A sequence
experimentally validated prokaryotic glycoproteins. The O-linked pattern was considered positive if central residue was glycosy-
glycosites (116) exclusively consisted of bacterial glycosites for lated otherwise the same was assigned as a negative pattern. To
unavailability of experimentally validated archaeal O-glycosite(s) generate a pattern corresponding to the terminal residues in a
[6]. To our knowledge, these are the most extensive datasets of protein sequence of length L, dummy residues ‘‘X’’ in number
experimentally validated prokaryotic glycosites (and glycopro- (L-1)/2 were added at both the termini of the protein [29,30].
teins), employed to develop first glycosite prediction models Binary profile of patterns (BPP). Fixed length of 21
trained on and for prokaryotic protein sequences. These datasets residues in sequence patterns was converted into binary form
are further divided in to two subgroups as follows. according to the existing study [30]. Each residue of patterns was
Balanced datasets derived from randomly selecting all represented by a vector of dimension 21 (e.g. Ala by
positive instances (positive training datasets) and equal number 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by
of negative instances (negative training datasets) across the protein 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), which contained 20 ami-
lengths. Balanced datasets are useful in accelerating the machine no acids and one dummy amino acid ‘‘X’’.

PLoS ONE | www.plosone.org 3 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

Table 3. An analysis of experimentally observed secondary structures in prokaryotic glycosites.

Validated Glycosites
Protein Name (Source Presence of glycan in full length protein Position of Glycosites
organism) PDB ID in structure sequence in PDB entry sequence SS

N-Glycosylated Proteins
Tetrabrachion 1YBK, – N44, N605, N641, N44 (N1279*) H1
(Staphylothermus marinus) 1FE6 N685, N708, N1279,
N1402
Chondroitinase ABC 1HN0 – N282, N338, N345, N282, N963 & N675 FR1 H1 B1
(Proteus vulgaris) N515, N675, N856, N338, N345 & N515
N963 N856
PotD (Escherichia coli) 1POT, – N26, N62 N26 FR3 FR1 (at beginning of
1POY N62 helix)
AcrA (Campylobacter jejuni) 2K32, Heptasaccharide N123, N273 N42 (N123*) FR1
2K33
(NMR)
PEB3 (Campylobacter jejuni) 2HXW - N90 N90 FR1 (between helices)
HmcA (Desulfovibrio gigas) 1Z1N Trisaccharide (NAG,NAA, N290 N261 FR2 (between beta-sheets)
any epimer of NAG),
O-Glycosylated Proteins
Chondroitinase-AC 1CB8, Tetrasaccharide Man- S328, S455 S328 S455 FR1 (just after helix)
(Pedobacter heparinus) 1HM2, (Rha)-GlcUA-Xyl, FR1 (between beta-sheets)
1HM3,
1HMU,
1HMW
Chondroitinase-B 1DBG, Heptasaccharide galactose- S234 S234 FR1 (between beta-strands)
(Pedobacter heparinus) 1DBO, b(1–4)[galactose-a(1–3)]
1OFL, (2-O-Me)fucose-b(1–4)
1OFM xylose-b (1–4)glucuronic
acid-a(1–2)[rhamnose-a
(1–4)]mannose-a(1-
Heparinase II 2FUQ, Tetrasaccharide Man- T134 T134 H3
(Pedobacter heparinus) 2FUT (Rha)-GlcUA-Xyl (xylose-b
(1–4)glucuronic acid-a
(1–2)[rhamnose-a(1–4)]
mannose-a(1-
Fimbrial protein 2HI2, Disaccharides a-D- S70 S63 H1 (before helix)
(Neisseria gonorrhoeae) 2HIL, galactopyranosyl-(1R3)-2,
2PIL, 4-diacetamido-2,4-dideoxy-
1AY2 b-D-glucopyranoside
(bacillosamine, Bac);
Gal-DADDGlc; and
GlcNAc-a1,3-Gal
Glycocin F 2KUY Two N-Acetylglucosamines S39 S18 FR3
(Lactobacillus plantarum) (NMR)
Endo-b-N-acetylglucosaminidase F3 1EOM, - T88 T49 FR1
(Flavobacterium meningosepticum) 1EOK

Footnotes: All crystal structures are obtained from www.rcsb.org. All structures are at a resolution of 1.4 Å or above.
Symbols used: - : No sugar detected, *: Corresponding position in full length protein sequence, F: flexible Regions with turns/loops/coils/bends or no assigned
secondary structure, H: helix, B: beta sheet, 1: Intra domain, 2: Interdomain, 3: no assigned domain.
doi:10.1371/journal.pone.0040155.t003

Composition profile of patterns (CPP). Composition PSSM profile of patterns (PPP). In addition to composi-
profile of patterns is the percentage frequencies of each amino tional information, PSSM provides important information of
acid in a fixed length sequence pattern. The fractions of all 20 evolutionary significance about residue conservation at a given
natural amino acids of fixed length sequence patterns were position in a protein sequence. The multiple sequence alignment
calculated using the following equation [30]: information in the form of position specific scoring matrix (PSSM)
has been used here to develop learning model where each
Ri glycosylated protein sequence was first searched against ‘SWISS-
Comp(i)~ |100 PROT’ database followed by generation of alignment profiles or
N
position specific scoring matrices (PSSM) using PSI-BLAST v
Where Comp(i) is the percent composition of amino acid residue of 2.2.20 program (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/
type i; Ri is number of amino acid residues of type i, and N is the blast+/LATEST/). Three iterations of PSI-BLAST were run for
total number of residues in the fixed length sequence pattern. each protein with cut off e-value 0.001. We have normalized each

PLoS ONE | www.plosone.org 4 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

PLoS ONE | www.plosone.org 5 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

Figure 1. GlycoPP websever Schema. A flowchart of methodologies employed for development of GlycoPP webserver for prediction of N & O-
glycosites in prokaryotic protein sequences.
doi:10.1371/journal.pone.0040155.g001

value range between 0 to 1 using sigmoid function by following


equation, where val is the PSSM score and Val is its normalized TP
value [29–31]: Sensitivity~ |100
TPzFN

1
Val~ Specificity is the percentage of non-glycosylated sites that are
1z(2:7182){val
correctly predicted as non-glycosylated:

Secondary structure information. For this study, the TN


secondary structure (SS) information (coil/helix/sheets) for glyco- Specificity~ |100
TN~FP
sylated residue and its sequence context was obtained using
webserver PSIPRED v 3.21 available at http://bioinfadmin.cs.ucl.
ac.uk/downloads/psipred/[32]. Accuracy is the percentage of correct prediction out of total
Surface accessibility information. The accessible surface number of predictions:
area (ASA) is the surface area of a protein that is accessible to
another protein or ligand(s). For our analysis, the average TPzTN
Accuracy~ |100
accessible surface area values of each amino acid were predicted TPzFPzTNzFN
from Sarpred available at www.imtech.res.in/raghava/sarpred/
[33].
Matthews correlation coefficient (MCC) is a measure of
Support Vector Machine (SVM) Algorithm and Evaluation both sensitivity and specificity. MCC value would range from 0
(indicating completely random prediction) to 1 (indicating perfect
Models
prediction):
The SVM is a supervised machine-learning technique based on
the structural risk minimization principle [34]. In this study, we
have used freely available SVMlight classifier v 6.01 (http:// (TP)(TN){(FP)(FN)
MCC~ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi |100
svmlight.joachims.org/) where we could adjust the parameters and ½TPzFP½TPzFN½TNzFP½TNzFN
kernel (linear, polynomial, radial basis function, sigmoid) func-
tions. The advantage of SVM over other machine learning [Where TP- true positive; FN- false negative; TN- true negative;
techniques is that it can be trained on small dataset (as in this FP- false positive]
study) with minimum over-optimization. SVM based approach Threshold selection is important criteria for checking the
has been successfully employed in developing both N- and O- consistency of prediction results. In our study, we have varied
glycosylation prediction tools for mammalian glycoproteins in past threshold in the range of –1 to +1, normally we selected ‘‘0’’ as
[13,14]. Using different sequence properties like identity and default threshold to achieve balance between sensitivity and
position of residues (BPP), percentage composition of residues specificity.
(CPP), residue conservation information (PSSM) along with Area Under Curve (AUC) a threshold independent parameter
structural features like secondary structure and surface accessibility describes inherent trade-off between sensitivity and specificity.
several SVM classifiers have been trained and optimized for this Receiver Operating Characteristic (ROC) plots were drawn
study. Our group has successfully used one or more of these between TP rate (sensitivity) and FP rate (1-specificity) using R-
features in predicting GTP interacting residues, Mannose package v 2.14.1 (http://www.r-project.org/) to calculate AUC
interacting residues, in predicting Cyclin protein sequences and values. Finally, the best performing models in terms of accuracy &
in identification of conformational B-cell Epitopes from primary MCC values were validated using an independent dataset of
sequences of proteins, previously [29,30,35,36]. In this study, a 5- prokaryotic glycoproteins for final implementation at GlycoPP
fold cross-validation procedure has been used to develop the webserver (Figure 1).
prediction model, where five subsets were constructed randomly
from the main datasets. At a given point of time, the models were Results
trained on four sets of the training dataset and the performance
was measured on the remaining fifth set. This process is repeated Prediction Performance of Some of the Existing Tools on
five times in such a way that each set was used once for testing. Prokaryotic Glycoproteins
The final performance was obtained by averaging the perfor- In order to evaluate the suitability of models trained on
mances of all five sets. The models thus obtained were evaluated eukaryotic glycoproteins for predicting glycosites in prokaryotic
for performance using threshold dependent parameters namely, proteins, the proteins of main datasets were run on three of the
sensitivity (Sn), Specificity (Sp), Accuracy (Acc), Matthews well-known prediction tools for prediction of N- and O-glycosites.
correlation coefficient (MCC) as well as using threshold dependent Against the experimentally validated glycoproteins of prokaryotes,
parameters Area Under Curve (AUC) values. the performances of these tools were found very poor and are
Evaluation parameters employed in this study are described detailed in Table 1. From this, we conclude that the methods that
briefly as below: are trained using eukaryotic glycoprotein are not optimum for
Sensitivity is the percentage of glycosites that are correctly prediction of potential glycosites in bacterial and archaeal proteins.
predicted as glycosylated: This also suggests that the sequence or structural contexts around
prokaryotic glycosites could be different from what is known in

PLoS ONE | www.plosone.org 6 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

eukaryotic glycosites. This is logical as several OSTs with novel prokaryotic O-glycosites, apart from this general presence of small
mechanisms of sugar transfer on to the acceptor proteins are now hydrophobic amino acids around 10 residues on either sides of
known in bacteria as well as archaea. This prompted us to develop glycosylated Ser/Thr residues (at position 0), a marked preference
a number of new algorithms to recognize and differentiate for negatively charged Asp at -1 that in fact is a part of potential
glycosylated and unglycosylated sequence contexts of known sequon for O-glycosites in Bacteroidetes is observed (Logo B, Figure 2).
glycosites of archaeal and bacterial proteins representing afore-
mentioned six different phyla. These algorithms are trained using Structural Features of Prokaryotic Glycosites
different input features and described in this study. Previous statistical analysis of all available crystal structures of
eukaryotic glycoproteins by Petrescu et al. had suggested that the
Sequence Context of Prokaryotic Glycosites probability of finding N-glycosites was higher at positions where
In an attempt to understand the general preferences for there was a secondary structure change [27]. Upon analysis of 12
different amino acids around prokaryotic glycosites as well as the eukaryotic glycoproteins, Julenius et al had also concluded that O-
differences from the corresponding sequences in eukaryotes, we glycosites mainly occurred in coil region of mucin type of O-
have generated a number of one sample and two sample weblogos glycosylated proteins [12]. The secondary structure and surface
(http://weblogo.berkeley.edu/& http://www.twosamplelogo.org/) accessibility of a residue therefore are considered important
for N- and O-glycosites of archaeal and bacterial glycoproteins in an criteria in prediction of glycosites in eukaryotes. Some of the
organism specific, phylum specifc as well as domain specific existing eukaryotic glycosite prediction models have employed
manner, respectively. The interesting existing knowledge as well these features successfully [11,12]. Unlike eukaryotic N-glycosyl-
as our statistically significant observations for the purposes of a ation that is a co-translational event, the glycosylation is
prediction model are discussed here, briefly. Similar to eukaryotic considered a true post-translational modification in bacteria where
glycoproteins, the minimal sequon NX(S/T)(where X?P) is the folding state of a polypeptide/protein could dictate availability
essential for N-glycosylation in prokaryotic glycoproteins. For of a sequon/site for glycan attachment on to a protein [21].
example in all archaeal glycoproteins (Figure S1), [25], in HmcA Although limited, yet most of the X-ray crystal structures and
protein of Desulfovibrio [37], adhesin protein HMW1 of Haemophilus NMR structures of bacterial glycoproteins (as listed at http://
influenzae and Actinobacillus pleuropneumoniae (where glycosylation is www.proglycprot.org/CrystalStructure.aspx) show that the glyco-
sequential and mediated by a novel cytoplasmic glycosyltransferase, sylated residues are indeed primarily located in surface-exposed
HMW1C of family GT41, Figure S2), [38]. However, as known flexible loops/turns/bends that then should be accessible to
already, the sequon is extended as (D/E)X1NX(S/T)(where X1 & bacterial OSTs/GTs. The structural contexts for 20 glycosites (13
X?P) but not stringent in case of PglB (OST of Campylobacter) N- & 7 O-glycosites) extracted from available structures of N- and
mediated en bloc N-glycosylation in Campylobacter and Helicobacter O-glycoproteins of Archaea and Bacteria, reveal that at least 65%
(Figure S2), [39]. As discussed before, the first-ever defined sequon (13 out of 20) of these glycosites are located in aforementioned
D(S/T)(A/I/L/V/M/T) for O-glycosites is indeed conserved flexible regions and primarily in intra-domain region (Table 3).
across available glycoproteins from three representative classes Incidentally, at least three of the N-glycosylated proteins namely,
including Bacteroidia, Flavobacteria and Sphingobacteria of phylum PotD of Escherichia coli, AcrA and PEB3 of Campylobacter jejuni are
Bacteroidetes (Figure S3). Further, the two-sample logos comparing glycosylated (in vitro/in vivo) by OST of Campylobacter (PglB) that has
prokaryotic and eukaryotic N-and O- glycosites clearly illustrate the previously been shown to transfer sugars post-translationally to
differences in the amino acid preferences around these glycosites locally flexible structures in folded proteins [21,28,40]. Similarly,
(Figure 2), indicating a necessity for independent prediction tool for glycosylated Ser/Thr residues in Endo-b-N-acetylglucosaminidase
prokaryotes. With respect to glycosylated Asn (if at position 0) the F3 (Flavobacterium meningosepticum), Chondroitinase-AC and Chon-
positions at -1 and -2 have previously been stated to be enriched in droitinase-B (Pedobacter heparinus) lie in the similar loops/bends in
aromatic amino acids in eukaryotic N-glycosites [25–27]. However, their respective crystal structures (Table 3).
in prokaryotic N-glycosites instead we observe a marked preference Our analysis of the predicted secondary structure indicates that
for polar residues like Asp/Glu/Thr/Asn and lysine at different 55.92% of the validated glycosites are found in coils, 15.51% in
positions preceding glycosylated Asn (Logo C, Figure 2). Similarly, helix and 28.57% in sheets whereas non-validated glycosites or
at positions -2 and -6 occurrence of polar residues is higher around their sequence contexts are found correspondingly less in coil
NX(S/T) motif in validated N-glycosites of prokaryotes in contrast (47.23%), more in helix (24.42%) and almost equally in sheets
to randomly selected equal number of NX(S/T) motifs with (28.35%). Similarly, 17.36% of validated O-glycosites are situated
unglycosylated Asn from prokaryotic glycoproteins (Logo A, in helix, 62.63% in coils and and 20.4% in sheets in contrast to
Figure 2). An analysis of eukaryotic N-glycosites by Pertescu and non-validated O-glycosites that are found more often in helix
co-workers had suggested a preference for small hydrophobic (22.99%) and less in coils (53.98) and almost equally in sheets
residue at positions +1 and large hydrophobic residue at +3 in (23.03), respectively (Figure 3). Similarly, predicted surface
eukaryotes, previously [27]. Similarly, in case of prokaryotic accessibility profile of glycosylated Asn residues suggest them to
glycoproteins hydrophobic residues are though present at +1 be much more surface accessible than the corresponding non-
position yet preference for large or small residues are not very clear glycosylated sequence contexts as shown in Figure 4. The
(Figure 2, Figure S1), [25]. Furthermore, increased instances of Pro glycosylated Ser/Thr are again, more accessible (80%) compared
near the glycosylated residues are not observed in bacterial and to the non-glycosylated residues (60%). To summarize, most of the
archaeal glycoproteins as found in eukaryotic glycoproteins. Instead prokaryotic glycosites (both N as well as O) indeed seems to be
Pro is one of the significantly depleted amino acids at +4 and +5 present in flexible and exposed regions. Further, not only the
positions here [27]. Likewise, sequence surrounding all prokaryotic central glycosylated-residues but also their surrounding residues
O-glycosites (Figure 2) is different in having higher instances of Gly, are highly accessible and surface exposed.
Ala, Val and a significant depletion of Pro at almost all positions
(except in mannosylated glycoproteins of Mycobacterium spp, Figure Prediction Performance of SVM Using Balanced Datasets
S2), [8] in comparison to the eukaryotic mucin type O-glycosites SVM models based on BPP, CPP and PSSM profiles are well
that are rich in Ser, Ala and Pro (Logo D, Figure 2), [12]. In recognized for their notable performances in predicting a variety

PLoS ONE | www.plosone.org 7 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

Figure 2. Sequence contexts of prokaryotic glycosites. Two sample weblogos depicting enriched and depleted amino acids around
prokaryotic N-glycosites (logo A) and prokaryotic O-glycosites (logo B) in comparison to the percentage of these amino acids around non-
glycosylated prokaryotic N-glycosites and O-glycosites, respectively. Similarly, logos C and D provide an assessment of probabilities of amino acids
around prokaryotic N- and O-glycosites in comparison to probabilities around eukaryotic N- and O-glycosites, respectively. The datasets for
eukaryotic N- and O- glycosites for generation of weblogos is obtained from SWISS-PROT (2011 release).
doi:10.1371/journal.pone.0040155.g002

of motifs and interactions in biomolecules and have been used of PPP+SS+ASA profiles) giving MCC of 0.48 and accuracy value
effectively in the past for glycosites predictions as well [29– of 71.73% in comparison with PPP alone derived 73.28%
31,35,36]. Accordingly, we have generated several SVM models accuracy and 0.47 MCC (Table 5).
using BPP, CPP and PPP profiles as input features. The
performance measures were calculated at different thresholds of Prediction Performance of SVM Using Realistic Datasets
SVM scores ranging from 21.0 to 1.0 and the best performing For any machine learning technique, learning of datasets is very
thresholds were selected for further optimization. The prediction easy when both positive and negative instances are equal in
of the N- glycosites were best achieved by the SVM models number. Nevertheless, in case of glycoproteins, the negative
developed using BPP profile achieving 79.91% accuracy and 0.60 instances could be much more than the positive instances in a
MCC (Table 4) whereas O-glycosites were best predicted by SVM protein sequence. Therefore, in order to judge accuracy and
model developed using PPP with 74.57% accuracy and 0.49 MCC applicability of our SVM prediction schemes on realistic datasets
(Table 5). As discussed before, sequence features namely of users, in parallel we have calculated the performances of
secondary structure (SS) and accessible surface area (ASA) could aforementioned SVM models using realistic datasets. As was seen
play an important role in correct predictions of sites of glycoslation in case of models optimized with balanced datasets, BPP based
in a protein. Therefore, we have developed prediction models with SVM models performed better in case of realistic datasets and
these features in following three combinations: (i) composition could achieve a maximum MCC 0.48 and 0.51 with accuracy
profile of patterns with either secondary structure or surface value of 82.03% and 86.39% for prediction of N-glycosites using
accessibility or both (ii) Binary profile of patterns with either solo feature based and hybrid models (BPP+ASA), respectively.
secondary structure or surface accessibility or both (iii) PPP with Similarly, O-glycosites could also be predicted with reasonably
either secondary structure or surface accessibility or both. In high accuracy of 70.24% and 89.69% (with corresponding
general, inclusion of SS and SAS profiles in prediction models maximum MCC values of 0.19 and 0.50) using CPP and
helped improvise predictions (Table 4, Table 5, Figure S4). The CPP+ASA based models, respectively. Surprisingly, while using
hybrid model of BPP+ASA proved as good as BPP+SS+ASA realistic datasets, predictions for O-glycosites were better with CPP
improving the maximum MCC of prediction from 0.60 to 0.65 based models in contrast to PPP based models that fared well in
and accuracy of prediction from 79.91% to 82.24% for N- case of balanced datasets (Table 5, Table S1, Figure S4). Infact
glycosites (Table 4). Similarly, predictions of O-glycosites were inclusion of surface accessibility features in combination with CPP
improvised slightly using the hybrid model (based on combination in O-glycosites prediction scheme could enhance maximum MCC

PLoS ONE | www.plosone.org 8 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

Figure 3. Predicted secondary structures around prokaryotic glycosites. Average percentage of secondary structures predicted in and
around N-glycosites (panel A) and O-glycosites (panel B) in prokaryotic glycoproteins. The graph indicates a general likelihood of locating a
glycosylated residue in coils/turns in a protein.
doi:10.1371/journal.pone.0040155.g003

Figure 4. Predicted Surface accessibility of prokaryotic glycosites. Average percentage of exposed and buried residues predicted in and
around N-glycosites (panel A) and O-glycosites (panel B) in prokaryotic glycoproteins. The graph suggests higher accessibility of glycosylated residues
on surface of a protein in comparison to non-glycosylated ones.
doi:10.1371/journal.pone.0040155.g004

PLoS ONE | www.plosone.org 9 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

Table 4. Combined performance statistics of SVM employing solo features and hybrid approaches in predicting N-glycosites
(using balanced dataset).

Feature Sensitivity (%) Specificity (%) Accuracy (%) MCC (%) AUC (%)

CPP 59.81 64.49 62.15 0.24 0.65019


CPP+SS 63.55 69.16 66.36 0.33 0.68731
CPP+ASA 71.03 69.16 70.09 0.40 0.77203
CPP+SS+ASA 70.09 67.29 68.69 0.37 0.71159
BPP 79.44 80.37 79.91 0.60 0.88322
BPP+SS 82.24 80.37 81.31 0.63 0.88453
BPP+ASA 84.11 81.31 82.71 0.65 0.89807
BPP+SS+ASA 84.11 80.37 82.24 0.65 0.88497
PPP 76.42 69.81 73.11 0.46 0.76833
PPP+SS 75.70 71.03 73.36 0.47 0.78880
PPP+ASA 77.57 71.03 74.30 0.49 0.78636
PPP+SS+ASA 75.70 71.96 73.83 0.48 0.79334

Footnotes: BPP- Binary profile of patterns, CPP- Composition profile of patterns, PPP- PSSM profile of patterns, MCC- Matthews correlation coefficient, AUC- Area under
curve, SS-secondary structure and ASA- Accessible surface area.
doi:10.1371/journal.pone.0040155.t004

value of prediction by 2.5 fold (Table S1, Figure S4) indicating that not only provided reasonably high accuracy (86.84% for N-
surface accessibility alone could indeed be a useful criterion in glycosites with maximum MCC of 0.74 and 76.23% for O-
glycosites prediction models discussed here. Further, the observed glycosites with maximum MCC value of 0.53, respectively) but
poorer performance of SVM with realistic datasets than with the have convincingly outperformed performances of at least three of
balanced datasets of course is due to the inherent learning biases of the well-known existing glycosites prediction tools as detailed in
realistic datasets. However, overall our SVM models optimized Table 6 in the context of prokaryotic glycosites prediction.
with realistic datasets fared reasonably well in predicting both N-
and O- glycosites from realistic datasets (Table S1). Description of Web-server
The overall best performing models described in Table 6 are
Prediction Performance on Independent Datasets implemented in the form of a web-server GlycoPP available freely
Finally, the performance of best-optimized models as discussed at http://www.imtech.res.in/raghava/glycopp/. The common
before (Table 4, Table 5) were evaluated and compared with gateway interface of GlycoPP is written using CGI/PERL script.
performances of NetOGlyc v 3.0, NetNGlyc, EnsembleGly against This server allows for prediction of N- and O-glycosites in
an independent set of experimentally verified prokaryotic glyco- prokaryotic protein sequences. Predictions can be performed by
proteins (Table 6). Models developed and discussed in this study the users at any of the user-defined thresholds ranging from 21.0

Table 5. Combined performance statistics of SVM classifiers employing solo features and hybrid approaches in predicting O-
glycosites (using balanced dataset).

Feature Sensitivity (%) Specificity (%) Accuracy (%) MCC (%) AUC (%)

CPP 68.10 72.41 70.26 0.41 0.74071


CPP+SS 70.69 71.55 71.12 0.42 0.75743
CPP+ASA 67.24 75.00 71.12 0.42 0.75780
CPP+SS+ASA 72.41 75.00 73.71 0.47 0.76955
BPP 66.38 67.24 66.81 0.34 0.73023
BPP+SS 69.83 68.10 68.97 0.38 0.74160
BPP+ASA 77.59 61.21 69.40 0.39 0.71143
BPP+SS+ASA 65.52 72.41 68.97 0.38 0.73766
PPP 75.00 71.55 73.28 0.47 0.81250
PPP+SS 73.28 73.28 73.28 0.47 0.76806
PPP+ASA 74.14 71.55 72.84 0.46 0.77341
PPP+SS+ASA 77.59 69.83 73.71 0.48 0.76925

Footnotes: BPP- Binary profile of patterns, CPP- Composition profile of patterns, PPP- PSSM profile of patterns, MCC- Matthews correlation coefficient, AUC- Area under
curve, SS-secondary structure and ASA- Accessible surface area.
doi:10.1371/journal.pone.0040155.t005

PLoS ONE | www.plosone.org 10 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

Table 6. Comparative performances of existing well-known glycosylation prediction tools and GlycoPP models on independent
dataset of prokaryotic glycoproteins.

Prediction of N-glycosites

GlycoPP-BPP GlycoPP-CPP GlycoPP-PPP GlycoPP-


Models (Threshold) NetNglyc1 (0.5) EnsembleGly3 (0.7) (20.1) (0.3) (20.2) BPP+ASA

Sensitivity (%) 88.89 94.44 89.47 68.42 78.95 89.47


Specificity (%) 25.00 11.36 73.68 73.68 73.68 84.21
Accuracy (%) 43.55 35.48 81.58 71.05 76.32 86.84
MCC (%) 0.15 0.09 0.64 0.42 0.53 0.74

Prediction of O-glycosites

Models NetOGlyc2 (0.1) EnsembleGly3 (0.3) GlycoPP-BPP GlycoPP-CPP GlycoPP-PPP GlycoPP-


(0.2) (0.2) (0) PPP+ASA

Sensitivity (%) 100.00 6.67 72.13 72.55 77.05 81.97


Specificity (%) 3.19 93.05 73.77 68.18 70.49 70.49
Accuracy (%) 8.27 88.28 72.95 70.36 73.77 76.23
MCC (%) 0.04 20.00 0.46 0.41 0.48 0.53

Footnotes: 1: http://www.cbs.dtu.dk/services/NetNGlyc/, 2: http://www.cbs.dtu.dk/services/NetOGlyc-3.0/, 3: http://turing.cs.iastate.edu/EnsembleGly/, BPP- Binary


profile of patterns, CPP- Composition profile of patterns, PPP- PSSM profile of patterns, MCC- Matthews correlation coefficient, AUC- Area under curve, SS-secondary
structure and ASA- Accessible surface area.
doi:10.1371/journal.pone.0040155.t006

to 1.0 for optimizing SVM scores. Input is acceptable as single or in a close homologue and available experimental data on type of
multiple sequences in standard FASTA format. linkages, attached sugars etc., for best interpretation of the results
obtained and also to decipher the biological significance of the
Discussion same. The datasets used in this study are currently the largest and
the most extensive available, yet inclusion of more validated
In this study, we have developed new SVM based glycosites sequences or features may further enhance the prediction
prediction models trained on and at least for N- and/or -O- accuracy, in future.
glycosylated proteins belonging to six different archaeal and Further, the preliminary information gleaned from various
bacterial phyla namely, Crenarchaeota, Euryarchaeota, Actinobacteria, organism-, phylum- and domain- specific weblogos of prokaryotic
Bacteroidetes, Firmicutes and Proteobacteria. The overall best perform- glycoproteins, suggest that sequence context of bacterial and
ing models are implemented at GlycoPP webserver available freely archaeal N-glycosites not only differs from eukaryotic ones but
to the users (Figure 1). Our approach is similar to the existing they may vary between archaea and bacteria as well (Figures S1).
models employed successfully for in silico identification of glycosites
In view of the understanding that the archaeal OST could be
in eukaryotic glycoproteins [13,14]. The webserver allows users to
evolutionarily closer to eukaryotic OST [41], it may be beneficial
identify probable sites of N- and O-glycosylation in proteins
to develop prediction tools separately for archaea and bacteria in
belonging to or to the similar bacteria or archaea as described
future, when sufficient experimental data is available. Similarly,
above, much more confidently than possible with the existing tools
the approach could be extended to different phyla under domain
of similar nature. In this study, we observed that BPP models
Bacteria where novel sequons for N- and O-glycosites seem to be
(containing single sequence information) were more efficient in
conserved with in a phylum. For example, preference for an acidic
discrimination of N-glycosylated and non-glycosylated sequences
residue at -2 position in sequon for N-glycosylation among
irrespective of their training on balanced or realistic datasets for
the presence of a defined consensus-sequon NX(S/T) in all N- epsilonbacteria like Campylobacter and Helicobacter and novel O-
glycosites. Whereas, in case of O-glycosylation, multiple sequence glycosylation sequon D(S/T)(A/I/L/V/M/T) in phylum Bacter-
information based PPP models performed better as the sole oidetes (Figures S2, S3) indicate that glycan and/or acceptor
classifying feature. Possibly, for the lack of a defined consensus- sequence specificities of OSTs/GTs could be conserved within a
sequon for most O-glycosites (except in phylum Bacteroidetes), close group of bacteria and archaea. Therefore, in future it will be
[6,24], PSSM derived profiles could well be more informative and desirable to develop tools where prediction could be made taking
useful for O-glycosites prediction. In our study, average surface in to account the glycan and/or acceptor sequence specificities of
accessibility emerged as a more useful criterion than secondary such individual protein glycosyltransferases of prokaryotes. How-
structure around glycosylated residues in most of our hybrid ever, as most of the OSTs involved in en-bloc N- and O-
prediction approaches. The tool in its existing form would be glycosylation both in archaea and bacteria including AglB, PglB,
useful for both single protein and proteome scale analysis. PglL and their homologues have been shown to have relaxed
However, users are encouraged to supplement these results with glycan specificity, the correlation between acceptor sequence
other complementary evidences like presence of signal peptides, specificity and glycan specificity of these enzymes may not be
transmemebrane domains, sub-cellular localization of the proteins, straight (Table 2), [37,42,43]. In this context, it could be
presence of certain OSTs or GTs in the genome of the organism to speculated that in prokaryotes a complex inter-play of available
indicate likely type and mode of glycosylation, known glycosylation biosynthesis machinery of certain precursor sugars, corresponding

PLoS ONE | www.plosone.org 11 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

glycans, presence of certain OSTs/GTs along with their fine tuned Bacteroides (panel A). Flavobacterium (panel B) and Paedobacter (panel
specificities or subtle preferences towards given glycans and/or C) of phylum Bacteroidetes (panel D).
acceptor sequences may define protein glycosylation under given (TIF)
conditions.
Figure S4 ROC plots for various hybrid models for prediction of
N-glycosites (panel A & B) and O-glycosites (panel C & D) using
Supporting Information balanced datasets and realistic datasets, respectively. The Area
The various datasets used in this study are available in
Under Curve (AUC) depicts relative trade-offs between true
downloadable format at http://www.imtech.res.in/raghava/
positives and false positives.
glycopp/suppli.html.
(TIF)
Supporting Information Table S1 Combined prediction performance of SVM employing
solo features and hybrid approaches (using realistic datasets).
Figure S1 Weblogos for archaeal N-glycosites (panel A) and (DOC)
bacterial N-glycosites (panel B).
(TIF)
Acknowledgments
Figure S2 Weblogos depicting two sequons for bacterial N-
JSC and AHB are thankful to Council of Scientific and Industrial Research
glycosites: (D/E)X1NX(S/T) in Campylobacter (panel A) and NX(S/
(CSIR) for their Senior and Junior research fellowships, respectively.
T) in Haemophilus (panel B). Panel D represents typical eukaryotic
mucin like sequence context around O-glycosites of mycobacterial
glycoproteins whereas O-glycosites in Campylobacter is Ser, Gly rich Author Contributions
as shown in panel C. Conceived and designed the experiments: AR GPSR. Performed the
(TIF) experiments: JSC. Analyzed the data: AR GPSR. Contributed reagents/
materials/analysis tools: AHB. Wrote the paper: JSC AHB AR.
Figure S3 Conserved sequon D(S/T)A/I/L/V/M/T at O-
glycosites in glycoproteins belonging to all major representatives:

References
1. Messner P (2004) Prokaryotic glycoproteins: unexplored but important. Journal 19. Ghoshal A, Mukhopadhyay S, Demine R, Forgber M, Jarmalavicius S, et al.
of Bacteriology 186: 2517–2519. (2009) Detection and characterization of a sialoglycosylated bacterial ABC-type
2. Abu-Qarn M, Eichler J, Sharon N (2008) Not just for Eukarya anymore: protein phosphate transporter protein from patients with visceral leishmaniasis.
glycosylation in Bacteria and Archaea. Curr Opin Struct Biol 18: 544–550. Glycoconj J 26: 675–89.
3. Lechner J, Wieland F (1989) Structure and biosynthesis of prokaryotic 20. Dell A, Galadari A, Sastre F, Hitchen P (2010) Similarities and differences in the
glycoproteins. Annu Rev Biochem 58: 173–194. glycosylation mechanisms in prokaryotes and eukaryotes. Int J Microbiol 2010:
4. Upreti RK, Kumar M, Shankar V (2003) Bacterial glycoproteins: functions, 148–178.
biosynthesis and applications. Proteomics 3: 363–379. 21. Nothaft H, Szymanski CM (2010) Protein glycosylation in bacteria: sweeter than
5. Varki A (1993) Biological roles of oligosaccharides: all of the theories are correct. ever. Nat Rev Microbiol 8: 765–778.
Glycobiology 3: 97–130. 22. Marino K, Bones J, Kattla JJ, Rudd PM (2010) A systematic approach to protein
6. Bhat AH, Mondal H, Chauhan JS, Raghava GP, Methi A, et al. (2012) glycosylation analysis: a path through the maze. Nat Chem Biol 6: 713–723.
ProGlycProt: a repository of experimentally characterized prokaryotic glyco- 23. Kowarik M, Young NM, Numao S, Schulz BL, Hug I, et al. (2006) Definition of
proteins. Nucleic Acids Res 40: D388–393. the bacterial N-glycosylation site consensus sequence. Embo Journal 25: 1957–
7. Benz I, Schmidt MA (2002) Never say never again: protein glycosylation in 1966.
pathogenic bacteria. Molecular Microbiology 45: 267–276. 24. Fletcher CM, Coyne MJ, Comstock LE (2011) Theoretical and experimental
8. Dobos KM, Khoo KH, Swiderek KM, Brennan PJ, Belisle JT (1996) Definition characterization of the scope of protein O-glycosylation in Bacteroides fragilis.
of the full extent of glycosylation of the 45-kilodalton glycoprotein of Journal of Biological Chemistry 286: 3219–3226.
Mycobacterium tuberculosis. Journal of Bacteriology 178: 2498–2506. 25. Abu-Qarn M, Eichler J (2007) An analysis of amino acid sequences surrounding
9. Roy K, Hamilton D, Ostmann MM, Fleckenstein JM (2009) Vaccination with archaeal glycoprotein sequons. Archaea 2: 73–81.
EtpA glycoprotein or flagellin protects against colonization with enterotoxigenic 26. Ben-Dor S, Esterman N, Rubin E, Sharon N (2004) Biases and complex patterns
Escherichia coli in a murine model. Vaccine 27: 4601–4608. in the residues flanking protein N-glycosylation sites. Glycobiology 14: 95–101.
10. Jennings MP, Jen FE, Roddam LF, Apicella MA, Edwards JL (2011) Neisseria 27. Petrescu AJ, Milac AL, Petrescu SM, Dwek RA, Wormald MR (2004) Statistical
gonorrhoeae pilin glycan contributes to CR3 activation during challenge of
analysis of the protein environment of N-glycosylation sites: implications for
primary cervical epithelial cells. Cell Microbiol 13: 885–896.
occupancy, structure, and folding. Glycobiology 14: 103–114.
11. Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, et al. (1998)
28. Kowarik M, Numao S, Feldman MF, Schulz BL, Callewaert N, et al. (2006) N-
NetOglyc: prediction of mucin type O-glycosylation sites based on sequence
linked glycosylation of folded proteins by the bacterial oligosaccharyltransferase.
context and surface accessibility. Glycoconj J 15: 115–130.
Science 314: 1148–1150.
12. Julenius K, Molgaard A, Gupta R, Brunak S (2005) Prediction, conservation
29. Chauhan JS, Mishra NK, Raghava GP (2010) Prediction of GTP interacting
analysis, and structural characterization of mammalian mucin-type O-
glycosylation sites. Glycobiology 15: 153–164. residues, dipeptides and tripeptides in a protein from its evolutionary
13. Gupta R, Brunak S (2002) Prediction of glycosylation across the human information. BMC Bioinformatics 11: 301.
proteome and the correlation to protein function. Pac Symp Biocomput: 310– 30. Agarwal S, Mishra NK, Singh H, Raghava GP (2011) Identification of mannose
322. interacting residues using local composition. PLoS One 6: e24039.
14. Caragea C, Sinapov J, Silvescu A, Dobbs D, Honavar V (2007) Glycosylation 31. Kumar M, Gromiha MM, Raghava GP (2008) Prediction of RNA binding sites
site prediction using ensembles of Support Vector Machine classifiers. BMC in a protein using SVM and PSSM profile. Proteins 71: 189–194.
Bioinformatics 8: 438. 32. McGuffin LJ, Bryson K, Jones DT (2000) The PSIPRED protein structure
15. Hamby SE, Hirst JD (2008) Prediction of glycosylation sites using random prediction server. Bioinformatics 16: 404–405.
forests. BMC Bioinformatics 9: 500. 33. Garg A, Kaur H, Raghava GP (2005) Real value prediction of solvent
16. Hanna ES, Roque-Barreira MC, Bernardes ES, Panunto-Castelo A, Sousa MV, accessibility in proteins using multiple sequence alignment and secondary
et al. (2007) Evidence for glycosylation on a DNA-binding protein of Salmonella structure. Proteins 61: 318–324.
enterica. Microb Cell Fact 6: 11. 34. Joachims T (1999) Making large-Scale SVM Learning Practical In: Advances in
17. Herrmann JL, Delahay R, Gallagher A, Robertson B, Young D (2000) Analysis Kernel Models - Support Vector Learning, B. Schölkopf and C. Burges and A.
of post-translational modification of mycobacterial proteins using a cassette Smola (ed.), MIT-Press.
expression system. FEBS Lett 473: 358–62. 35. Kalita MK, Nandal UK, Pattnaik A, Sivalingam A, Ramasamy G, et al. (2008)
18. Balonova L, Hernychova L, Mann BF, Link M, Bilkova Z, et al. (2010) CyclinPred: a SVM-based method for predicting cyclin protein sequences. PLoS
Multimethodological approach to identification of glycoproteins from the One 3: e2605.
proteome of Francisella tularensis, an intracellular microorganism. J Proteome 36. Ansari HR, Raghava GP (2010) Identification of conformational B-cell Epitopes
Res 9: 1995–2005. in an antigen from its primary sequence. Immunome Res 6: 6.

PLoS ONE | www.plosone.org 12 July 2012 | Volume 7 | Issue 7 | e40155


Prokaryotic Glycosylated-Residue Prediction

37. Ielmini MV, Feldman MF (2011) Desulfovibrio desulfuricans PglB homolog 40. Rangarajan ES, Bhatia S, Watson DC, Munger C, Cygler M, et al. (2007)
possesses oligosaccharyltransferase activity with relaxed glycan specificity and Structural context for protein N-glycosylation in bacteria: The structure of
distinct protein acceptor sequence requirements. Glycobiology 21: 734–742. PEB3, an adhesin from Campylobacter jejuni. Protein Sci 16: 990–995.
38. Choi KJ, Grass S, Paek S, St Geme JW, 3rd, Yeo HJ (2010) The Actinobacillus 41. Maita N, Nyirenda J, Igura M, Kamishikiryo J, Kohda D (2010) Comparative
pleuropneumoniae HMW1C-like glycosyltransferase mediates N-linked glyco- structural biology of eubacterial and archaeal oligosaccharyltransferases. Journal
sylation of the Haemophilus influenzae HMW1 adhesin. PLoS One 5: e15888. of Biological Chemistry 285: 4941–4950.
39. Jervis AJ, Langdon R, Hitchen P, Lawson AJ, Wood A, et al. (2010) 42. Faridmoayer A, Fentabil MA, Haurat MF, Yi W, Woodward R, et al. (2008)
Characterization of N-linked protein glycosylation in Helicobacter pullorum. Extreme substrate promiscuity of the Neisseria oligosaccharyl transferase involved
Journal of Bacteriology 192: 5228–5236. in protein O-glycosylation. J Biol Chem 283: 34596–34604.
43. Calo D, Kaminski L, Eichler J (2010) Protein glycosylation in Archaea: sweet
and extreme. Glycobiology 20: 1065–1076.

PLoS ONE | www.plosone.org 13 July 2012 | Volume 7 | Issue 7 | e40155

You might also like