PhosphoSVM: Prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine

Yongchao Dou, Bo Yao, Chi Zhang

Research output: Contribution to journalArticle

59 Citations (Scopus)

Abstract

Phosphorylation is one of the most essential post-translational modifications in eukaryotes. Studies on kinases and their substrates are important for understanding cellular signaling networks. Because of the cost in time and labor associated with large-scale wet-bench experiments, computational prediction of phosphorylation sites becomes important and many computational tools have been developed in the recent decades. The prediction tools can be grouped into two categories: kinase-specific and non-kinase-specific tools. With more kinases being discovered by the new sequencing technologies, accurate non-kinase-specific prediction tools are highly desirable for whole-genome annotation in a wider variety of species. In this manuscript, a support vector machine is used to combine eight different sequence level scoring functions to predict phosphorylation sites. The attributes used by this work, including Shannon entropy, relative entropy, predicted protein secondary structure, predicted protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity, and k-nearest neighbor, were able to obtain better results than the previously used attributes by other similar methods. This method achieved AUC values of 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites, respectively, in animals with a tenfold cross-validation. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test. The AUC values for the independent test dataset were 0.7761/0.6652/0.5958 for S/T/Y phosphorylation sites, which compared favorably with those of several existing methods. A web server based on our method was constructed for public use. The server, trained model, and all datasets used in the current study are available at http://sysbio.unl.edu/PhosphoSVM .

Original languageEnglish (US)
Pages (from-to)1459-1469
Number of pages11
JournalAmino Acids
Volume46
Issue number6
DOIs
StatePublished - Jun 2014

Fingerprint

Phosphorylation
Support vector machines
Proteins
Phosphotransferases
Entropy
Area Under Curve
Animals
Servers
Cell signaling
Secondary Protein Structure
Threonine
Post Translational Protein Processing
Hydrophobicity
Support Vector Machine
Eukaryota
Hydrophobic and Hydrophilic Interactions
Serine
Tyrosine
Animal Models
Genes

Keywords

  • Non-kinase-specific tool
  • Phosphorylation site prediction
  • Support vector machine

ASJC Scopus subject areas

  • Biochemistry
  • Clinical Biochemistry
  • Organic Chemistry
  • Medicine(all)

Cite this

PhosphoSVM : Prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. / Dou, Yongchao; Yao, Bo; Zhang, Chi.

In: Amino Acids, Vol. 46, No. 6, 06.2014, p. 1459-1469.

Research output: Contribution to journalArticle

@article{4c3e083a4db64e77a46f691bb6d15a27,
title = "PhosphoSVM: Prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine",
abstract = "Phosphorylation is one of the most essential post-translational modifications in eukaryotes. Studies on kinases and their substrates are important for understanding cellular signaling networks. Because of the cost in time and labor associated with large-scale wet-bench experiments, computational prediction of phosphorylation sites becomes important and many computational tools have been developed in the recent decades. The prediction tools can be grouped into two categories: kinase-specific and non-kinase-specific tools. With more kinases being discovered by the new sequencing technologies, accurate non-kinase-specific prediction tools are highly desirable for whole-genome annotation in a wider variety of species. In this manuscript, a support vector machine is used to combine eight different sequence level scoring functions to predict phosphorylation sites. The attributes used by this work, including Shannon entropy, relative entropy, predicted protein secondary structure, predicted protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity, and k-nearest neighbor, were able to obtain better results than the previously used attributes by other similar methods. This method achieved AUC values of 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites, respectively, in animals with a tenfold cross-validation. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test. The AUC values for the independent test dataset were 0.7761/0.6652/0.5958 for S/T/Y phosphorylation sites, which compared favorably with those of several existing methods. A web server based on our method was constructed for public use. The server, trained model, and all datasets used in the current study are available at http://sysbio.unl.edu/PhosphoSVM .",
keywords = "Non-kinase-specific tool, Phosphorylation site prediction, Support vector machine",
author = "Yongchao Dou and Bo Yao and Chi Zhang",
year = "2014",
month = "6",
doi = "10.1007/s00726-014-1711-5",
language = "English (US)",
volume = "46",
pages = "1459--1469",
journal = "Amino Acids",
issn = "0939-4451",
publisher = "Springer Wien",
number = "6",

}

TY - JOUR

T1 - PhosphoSVM

T2 - Prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine

AU - Dou, Yongchao

AU - Yao, Bo

AU - Zhang, Chi

PY - 2014/6

Y1 - 2014/6

N2 - Phosphorylation is one of the most essential post-translational modifications in eukaryotes. Studies on kinases and their substrates are important for understanding cellular signaling networks. Because of the cost in time and labor associated with large-scale wet-bench experiments, computational prediction of phosphorylation sites becomes important and many computational tools have been developed in the recent decades. The prediction tools can be grouped into two categories: kinase-specific and non-kinase-specific tools. With more kinases being discovered by the new sequencing technologies, accurate non-kinase-specific prediction tools are highly desirable for whole-genome annotation in a wider variety of species. In this manuscript, a support vector machine is used to combine eight different sequence level scoring functions to predict phosphorylation sites. The attributes used by this work, including Shannon entropy, relative entropy, predicted protein secondary structure, predicted protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity, and k-nearest neighbor, were able to obtain better results than the previously used attributes by other similar methods. This method achieved AUC values of 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites, respectively, in animals with a tenfold cross-validation. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test. The AUC values for the independent test dataset were 0.7761/0.6652/0.5958 for S/T/Y phosphorylation sites, which compared favorably with those of several existing methods. A web server based on our method was constructed for public use. The server, trained model, and all datasets used in the current study are available at http://sysbio.unl.edu/PhosphoSVM .

AB - Phosphorylation is one of the most essential post-translational modifications in eukaryotes. Studies on kinases and their substrates are important for understanding cellular signaling networks. Because of the cost in time and labor associated with large-scale wet-bench experiments, computational prediction of phosphorylation sites becomes important and many computational tools have been developed in the recent decades. The prediction tools can be grouped into two categories: kinase-specific and non-kinase-specific tools. With more kinases being discovered by the new sequencing technologies, accurate non-kinase-specific prediction tools are highly desirable for whole-genome annotation in a wider variety of species. In this manuscript, a support vector machine is used to combine eight different sequence level scoring functions to predict phosphorylation sites. The attributes used by this work, including Shannon entropy, relative entropy, predicted protein secondary structure, predicted protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity, and k-nearest neighbor, were able to obtain better results than the previously used attributes by other similar methods. This method achieved AUC values of 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites, respectively, in animals with a tenfold cross-validation. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test. The AUC values for the independent test dataset were 0.7761/0.6652/0.5958 for S/T/Y phosphorylation sites, which compared favorably with those of several existing methods. A web server based on our method was constructed for public use. The server, trained model, and all datasets used in the current study are available at http://sysbio.unl.edu/PhosphoSVM .

KW - Non-kinase-specific tool

KW - Phosphorylation site prediction

KW - Support vector machine

UR - http://www.scopus.com/inward/record.url?scp=84901445385&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84901445385&partnerID=8YFLogxK

U2 - 10.1007/s00726-014-1711-5

DO - 10.1007/s00726-014-1711-5

M3 - Article

C2 - 24623121

AN - SCOPUS:84901445385

VL - 46

SP - 1459

EP - 1469

JO - Amino Acids

JF - Amino Acids

SN - 0939-4451

IS - 6

ER -