Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data

Research output: Contribution to journalArticle

Abstract

Background: Functional assignments for short-read metagenomic data pose a significant computational challenge due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from translated protein-fragments/peptides. To address this problem, we have examined the predictability of short peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from well-characterized proteins as well as hypothetical proteins in the KEGG database. Results: Using test sets of peptides modeling the length and phylogenetic distributions of short-read metagenomic data, we observed that peptides from well-characterized proteins had indistinguishable alignments to proteins from the same orthologous family and proteins from different families. Nonetheless, the patterns contained remarkable phylogenetic and structural signals, with alignments of even very short peptides naturally restricted to their orthologous family and/or proteins having similar structural folds. In stark contrast, peptides from "hypothetical proteins" had only sparse hit patterns with low frequencies and much lower identities. By weighting the structure-driven alignments and filtering peptides with behaviors similar to those derived from "hypothetical proteins", we demonstrate that the accuracy of abundance predictions of protein families is dramatically improved. Conclusions: Evolutionary processes have dispersed protein folds across multiple protein families, precluding accurate functional assignment to short peptides, whose alignment behavior is non-random and driven by structure. Algorithms that filter sparse peptides and weight hit patterns of peptides from "known space" dramatically improve quantification of functions from diverse mixtures of peptides and should substantially improve applications of metagenomic analyses requiring accurate quantitative measures of functional families.

Original languageEnglish (US)
Article number1080
JournalBMC genomics
Volume16
Issue number1
DOIs
StatePublished - Dec 21 2015

Fingerprint

Metagenomics
Peptides
Proteins
Peptide Fragments

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data. / Sinha, Rohita; Clarke, Jennifer; Benson, Andrew K.

In: BMC genomics, Vol. 16, No. 1, 1080, 21.12.2015.

Research output: Contribution to journalArticle

@article{c5f8d355afbf47da9fdb326e573ffd14,
title = "Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data",
abstract = "Background: Functional assignments for short-read metagenomic data pose a significant computational challenge due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from translated protein-fragments/peptides. To address this problem, we have examined the predictability of short peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from well-characterized proteins as well as hypothetical proteins in the KEGG database. Results: Using test sets of peptides modeling the length and phylogenetic distributions of short-read metagenomic data, we observed that peptides from well-characterized proteins had indistinguishable alignments to proteins from the same orthologous family and proteins from different families. Nonetheless, the patterns contained remarkable phylogenetic and structural signals, with alignments of even very short peptides naturally restricted to their orthologous family and/or proteins having similar structural folds. In stark contrast, peptides from {"}hypothetical proteins{"} had only sparse hit patterns with low frequencies and much lower identities. By weighting the structure-driven alignments and filtering peptides with behaviors similar to those derived from {"}hypothetical proteins{"}, we demonstrate that the accuracy of abundance predictions of protein families is dramatically improved. Conclusions: Evolutionary processes have dispersed protein folds across multiple protein families, precluding accurate functional assignment to short peptides, whose alignment behavior is non-random and driven by structure. Algorithms that filter sparse peptides and weight hit patterns of peptides from {"}known space{"} dramatically improve quantification of functions from diverse mixtures of peptides and should substantially improve applications of metagenomic analyses requiring accurate quantitative measures of functional families.",
author = "Rohita Sinha and Jennifer Clarke and Benson, {Andrew K.}",
year = "2015",
month = "12",
day = "21",
doi = "10.1186/s12864-015-2272-z",
language = "English (US)",
volume = "16",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Alignment behaviors of short peptides provide a roadmap for functional profiling of metagenomic data

AU - Sinha, Rohita

AU - Clarke, Jennifer

AU - Benson, Andrew K.

PY - 2015/12/21

Y1 - 2015/12/21

N2 - Background: Functional assignments for short-read metagenomic data pose a significant computational challenge due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from translated protein-fragments/peptides. To address this problem, we have examined the predictability of short peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from well-characterized proteins as well as hypothetical proteins in the KEGG database. Results: Using test sets of peptides modeling the length and phylogenetic distributions of short-read metagenomic data, we observed that peptides from well-characterized proteins had indistinguishable alignments to proteins from the same orthologous family and proteins from different families. Nonetheless, the patterns contained remarkable phylogenetic and structural signals, with alignments of even very short peptides naturally restricted to their orthologous family and/or proteins having similar structural folds. In stark contrast, peptides from "hypothetical proteins" had only sparse hit patterns with low frequencies and much lower identities. By weighting the structure-driven alignments and filtering peptides with behaviors similar to those derived from "hypothetical proteins", we demonstrate that the accuracy of abundance predictions of protein families is dramatically improved. Conclusions: Evolutionary processes have dispersed protein folds across multiple protein families, precluding accurate functional assignment to short peptides, whose alignment behavior is non-random and driven by structure. Algorithms that filter sparse peptides and weight hit patterns of peptides from "known space" dramatically improve quantification of functions from diverse mixtures of peptides and should substantially improve applications of metagenomic analyses requiring accurate quantitative measures of functional families.

AB - Background: Functional assignments for short-read metagenomic data pose a significant computational challenge due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from translated protein-fragments/peptides. To address this problem, we have examined the predictability of short peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from well-characterized proteins as well as hypothetical proteins in the KEGG database. Results: Using test sets of peptides modeling the length and phylogenetic distributions of short-read metagenomic data, we observed that peptides from well-characterized proteins had indistinguishable alignments to proteins from the same orthologous family and proteins from different families. Nonetheless, the patterns contained remarkable phylogenetic and structural signals, with alignments of even very short peptides naturally restricted to their orthologous family and/or proteins having similar structural folds. In stark contrast, peptides from "hypothetical proteins" had only sparse hit patterns with low frequencies and much lower identities. By weighting the structure-driven alignments and filtering peptides with behaviors similar to those derived from "hypothetical proteins", we demonstrate that the accuracy of abundance predictions of protein families is dramatically improved. Conclusions: Evolutionary processes have dispersed protein folds across multiple protein families, precluding accurate functional assignment to short peptides, whose alignment behavior is non-random and driven by structure. Algorithms that filter sparse peptides and weight hit patterns of peptides from "known space" dramatically improve quantification of functions from diverse mixtures of peptides and should substantially improve applications of metagenomic analyses requiring accurate quantitative measures of functional families.

UR - http://www.scopus.com/inward/record.url?scp=84953716701&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84953716701&partnerID=8YFLogxK

U2 - 10.1186/s12864-015-2272-z

DO - 10.1186/s12864-015-2272-z

M3 - Article

C2 - 26691573

AN - SCOPUS:84953716701

VL - 16

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 1080

ER -