Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis

George K. Acquaah-Mensah, Sonia M. Leach, Chittibabu Guda

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naïve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.

Original languageEnglish (US)
Pages (from-to)120-133
Number of pages14
JournalGenomics, Proteomics and Bioinformatics
Volume4
Issue number2
DOIs
StatePublished - May 1 2006

Fingerprint

Exploratory Data Analysis
Learning systems
Machine Learning
Amino acids
Proteins
Protein
Amino Acid Sequence
Amino Acids
Proportion
Neutral Amino Acids
Classifiers
Bayes Classifier
Molecular Sequence Annotation
Human
Decision Trees
Protein Databases
Neural Networks (Computer)
Plasma Membrane
Mitochondrial Proteins
Tree Algorithms

Keywords

  • Decision Tree
  • Exploratory Data Analysis
  • Machine Learning
  • subcellular localization

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Genetics
  • Computational Mathematics

Cite this

Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis. / Acquaah-Mensah, George K.; Leach, Sonia M.; Guda, Chittibabu.

In: Genomics, Proteomics and Bioinformatics, Vol. 4, No. 2, 01.05.2006, p. 120-133.

Research output: Contribution to journalArticle

@article{f1cdcdba423f4cfdbc2d3549d6eb90c9,
title = "Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis",
abstract = "Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Na{\"i}ve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.",
keywords = "Decision Tree, Exploratory Data Analysis, Machine Learning, subcellular localization",
author = "Acquaah-Mensah, {George K.} and Leach, {Sonia M.} and Chittibabu Guda",
year = "2006",
month = "5",
day = "1",
doi = "10.1016/S1672-0229(06)60023-5",
language = "English (US)",
volume = "4",
pages = "120--133",
journal = "Genomics Proteomics Bioinformatics",
issn = "1672-0229",
publisher = "Beijing Genomics Institute",
number = "2",

}

TY - JOUR

T1 - Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis

AU - Acquaah-Mensah, George K.

AU - Leach, Sonia M.

AU - Guda, Chittibabu

PY - 2006/5/1

Y1 - 2006/5/1

N2 - Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naïve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.

AB - Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naïve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.

KW - Decision Tree

KW - Exploratory Data Analysis

KW - Machine Learning

KW - subcellular localization

UR - http://www.scopus.com/inward/record.url?scp=33747365645&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33747365645&partnerID=8YFLogxK

U2 - 10.1016/S1672-0229(06)60023-5

DO - 10.1016/S1672-0229(06)60023-5

M3 - Article

VL - 4

SP - 120

EP - 133

JO - Genomics Proteomics Bioinformatics

JF - Genomics Proteomics Bioinformatics

SN - 1672-0229

IS - 2

ER -