Predicting functional family of novel enzymes irrespective of sequence similarity: A statistical learning approach

L. Y. Han, C. Z. Cai, Z. L. Ji, Z. W. Cao, J. Cui, Y. Z. Chen

Research output: Contribution to journalArticle

74 Citations (Scopus)

Abstract

The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72% of the enzymes in the first group and 62% of the enzyme pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.

Original languageEnglish (US)
Pages (from-to)6437-6444
Number of pages8
JournalNucleic acids research
Volume32
Issue number21
DOIs
StatePublished - Dec 1 2004

Fingerprint

Learning
Enzymes
Proteins
Protein Databases
Sequence Homology
Software
Support Vector Machine

ASJC Scopus subject areas

  • Genetics

Cite this

Predicting functional family of novel enzymes irrespective of sequence similarity : A statistical learning approach. / Han, L. Y.; Cai, C. Z.; Ji, Z. L.; Cao, Z. W.; Cui, J.; Chen, Y. Z.

In: Nucleic acids research, Vol. 32, No. 21, 01.12.2004, p. 6437-6444.

Research output: Contribution to journalArticle

Han, L. Y. ; Cai, C. Z. ; Ji, Z. L. ; Cao, Z. W. ; Cui, J. ; Chen, Y. Z. / Predicting functional family of novel enzymes irrespective of sequence similarity : A statistical learning approach. In: Nucleic acids research. 2004 ; Vol. 32, No. 21. pp. 6437-6444.
@article{63c0f5b8a5084382a64b0ed716fbce66,
title = "Predicting functional family of novel enzymes irrespective of sequence similarity: A statistical learning approach",
abstract = "The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72{\%} of the enzymes in the first group and 62{\%} of the enzyme pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.",
author = "Han, {L. Y.} and Cai, {C. Z.} and Ji, {Z. L.} and Cao, {Z. W.} and J. Cui and Chen, {Y. Z.}",
year = "2004",
month = "12",
day = "1",
doi = "10.1093/nar/gkh984",
language = "English (US)",
volume = "32",
pages = "6437--6444",
journal = "Nucleic Acids Research",
issn = "0305-1048",
publisher = "Oxford University Press",
number = "21",

}

TY - JOUR

T1 - Predicting functional family of novel enzymes irrespective of sequence similarity

T2 - A statistical learning approach

AU - Han, L. Y.

AU - Cai, C. Z.

AU - Ji, Z. L.

AU - Cao, Z. W.

AU - Cui, J.

AU - Chen, Y. Z.

PY - 2004/12/1

Y1 - 2004/12/1

N2 - The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72% of the enzymes in the first group and 62% of the enzyme pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.

AB - The function of a protein that has no sequence homolog of known function is difficult to assign on the basis of sequence similarity. The same problem may arise for homologous proteins of different functions if one is newly discovered and the other is the only known protein of similar sequence. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family of a protein to provide useful hint about its function. Several groups have employed a statistical learning method, support vector machines (SVMs), for predicting protein functional family directly from sequence irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a level useful for functional family assignment. But its capability for assignment of distantly related proteins and homologous proteins of different functions has not been critically and adequately assessed. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 50 enzymes that have no homolog of known function from PSI-BLAST search of protein databases. The other contains eight pairs of homologous enzymes of different families. SVM correctly assigns 72% of the enzymes in the first group and 62% of the enzyme pairs in the second group, suggesting that it is potentially useful for facilitating functional study of novel proteins. A web version of our software, SVMProt, is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.

UR - http://www.scopus.com/inward/record.url?scp=13444251475&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=13444251475&partnerID=8YFLogxK

U2 - 10.1093/nar/gkh984

DO - 10.1093/nar/gkh984

M3 - Article

C2 - 15585667

AN - SCOPUS:13444251475

VL - 32

SP - 6437

EP - 6444

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 0305-1048

IS - 21

ER -