Biological sequence simulation for testing complex evolutionary hypotheses

Indel-seq-gen version 2.0

Cory L. Strope, Kevin Abel, Stephen D. Scott, Etsuko Moriyama

Research output: Contribution to journalArticle

37 Citations (Scopus)

Abstract

Sequence simulation is an important tool in validating biological hypotheses as well as testing various bioinformatics and molecular evolutionary methods. Hypothesis testing relies on the representational ability of the sequence simulation method. Simple hypotheses are testable through simulation of random, homogeneously evolving sequence sets. However, testing complex hypotheses, for example, local similarities, requires simulation of sequence evolution under heterogeneous models. To this end, we previously introduced indel-Seq-Gen version 1.0 (iSGv1.0; indel, insertion/deletion). iSGv1.0 allowed heterogeneous protein evolution and motif conservation as well as insertion and deletion constraints in subsequences. Despite these advances, for complex hypothesis testing, neither iSGv1.0 nor other currently available sequence simulation methods is sufficient. indel-Seq-Gen version 2.0 (iSGv2.0) aims at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 improves upon iSGv1.0 through the addition of lineage-specific evolution, motif conservation using PROSITE-like regular expressions, indel tracking, subsequence-length constraints, as well as coding and noncoding DNA evolution. Furthermore, we formalize the sequence representation used for iSGv2.0 and uncover a flaw in the modeling of indels used in current state of the art methods, which biases simulation results for hypotheses involving indels. We fix this flaw in iSGv2.0 by using a novel discrete stepping procedure. Finally, we present an example simulation of the calycin-superfamily sequences and compare the performance of iSGv2.0 with iSGv1.0 and random model of sequence evolution.

Original languageEnglish (US)
Pages (from-to)2581-2593
Number of pages13
JournalMolecular biology and evolution
Volume26
Issue number11
DOIs
StatePublished - Nov 1 2009

Fingerprint

hypothesis testing
simulation
testing
Amino Acid Motifs
Computational Biology
DNA
protein
bioinformatics
intergenic DNA
methodology
proteins
nucleotide sequences
method
Proteins
modeling

Keywords

  • Domains
  • Indels
  • Motifs
  • Protein superfamily
  • Sequence simulation

ASJC Scopus subject areas

  • Medicine(all)
  • Ecology, Evolution, Behavior and Systematics
  • Molecular Biology
  • Genetics

Cite this

Biological sequence simulation for testing complex evolutionary hypotheses : Indel-seq-gen version 2.0. / Strope, Cory L.; Abel, Kevin; Scott, Stephen D.; Moriyama, Etsuko.

In: Molecular biology and evolution, Vol. 26, No. 11, 01.11.2009, p. 2581-2593.

Research output: Contribution to journalArticle

@article{f0443bce972e4e0fab9d09e32c15895d,
title = "Biological sequence simulation for testing complex evolutionary hypotheses: Indel-seq-gen version 2.0",
abstract = "Sequence simulation is an important tool in validating biological hypotheses as well as testing various bioinformatics and molecular evolutionary methods. Hypothesis testing relies on the representational ability of the sequence simulation method. Simple hypotheses are testable through simulation of random, homogeneously evolving sequence sets. However, testing complex hypotheses, for example, local similarities, requires simulation of sequence evolution under heterogeneous models. To this end, we previously introduced indel-Seq-Gen version 1.0 (iSGv1.0; indel, insertion/deletion). iSGv1.0 allowed heterogeneous protein evolution and motif conservation as well as insertion and deletion constraints in subsequences. Despite these advances, for complex hypothesis testing, neither iSGv1.0 nor other currently available sequence simulation methods is sufficient. indel-Seq-Gen version 2.0 (iSGv2.0) aims at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 improves upon iSGv1.0 through the addition of lineage-specific evolution, motif conservation using PROSITE-like regular expressions, indel tracking, subsequence-length constraints, as well as coding and noncoding DNA evolution. Furthermore, we formalize the sequence representation used for iSGv2.0 and uncover a flaw in the modeling of indels used in current state of the art methods, which biases simulation results for hypotheses involving indels. We fix this flaw in iSGv2.0 by using a novel discrete stepping procedure. Finally, we present an example simulation of the calycin-superfamily sequences and compare the performance of iSGv2.0 with iSGv1.0 and random model of sequence evolution.",
keywords = "Domains, Indels, Motifs, Protein superfamily, Sequence simulation",
author = "Strope, {Cory L.} and Kevin Abel and Scott, {Stephen D.} and Etsuko Moriyama",
year = "2009",
month = "11",
day = "1",
doi = "10.1093/molbev/msp174",
language = "English (US)",
volume = "26",
pages = "2581--2593",
journal = "Molecular Biology and Evolution",
issn = "0737-4038",
publisher = "Oxford University Press",
number = "11",

}

TY - JOUR

T1 - Biological sequence simulation for testing complex evolutionary hypotheses

T2 - Indel-seq-gen version 2.0

AU - Strope, Cory L.

AU - Abel, Kevin

AU - Scott, Stephen D.

AU - Moriyama, Etsuko

PY - 2009/11/1

Y1 - 2009/11/1

N2 - Sequence simulation is an important tool in validating biological hypotheses as well as testing various bioinformatics and molecular evolutionary methods. Hypothesis testing relies on the representational ability of the sequence simulation method. Simple hypotheses are testable through simulation of random, homogeneously evolving sequence sets. However, testing complex hypotheses, for example, local similarities, requires simulation of sequence evolution under heterogeneous models. To this end, we previously introduced indel-Seq-Gen version 1.0 (iSGv1.0; indel, insertion/deletion). iSGv1.0 allowed heterogeneous protein evolution and motif conservation as well as insertion and deletion constraints in subsequences. Despite these advances, for complex hypothesis testing, neither iSGv1.0 nor other currently available sequence simulation methods is sufficient. indel-Seq-Gen version 2.0 (iSGv2.0) aims at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 improves upon iSGv1.0 through the addition of lineage-specific evolution, motif conservation using PROSITE-like regular expressions, indel tracking, subsequence-length constraints, as well as coding and noncoding DNA evolution. Furthermore, we formalize the sequence representation used for iSGv2.0 and uncover a flaw in the modeling of indels used in current state of the art methods, which biases simulation results for hypotheses involving indels. We fix this flaw in iSGv2.0 by using a novel discrete stepping procedure. Finally, we present an example simulation of the calycin-superfamily sequences and compare the performance of iSGv2.0 with iSGv1.0 and random model of sequence evolution.

AB - Sequence simulation is an important tool in validating biological hypotheses as well as testing various bioinformatics and molecular evolutionary methods. Hypothesis testing relies on the representational ability of the sequence simulation method. Simple hypotheses are testable through simulation of random, homogeneously evolving sequence sets. However, testing complex hypotheses, for example, local similarities, requires simulation of sequence evolution under heterogeneous models. To this end, we previously introduced indel-Seq-Gen version 1.0 (iSGv1.0; indel, insertion/deletion). iSGv1.0 allowed heterogeneous protein evolution and motif conservation as well as insertion and deletion constraints in subsequences. Despite these advances, for complex hypothesis testing, neither iSGv1.0 nor other currently available sequence simulation methods is sufficient. indel-Seq-Gen version 2.0 (iSGv2.0) aims at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 improves upon iSGv1.0 through the addition of lineage-specific evolution, motif conservation using PROSITE-like regular expressions, indel tracking, subsequence-length constraints, as well as coding and noncoding DNA evolution. Furthermore, we formalize the sequence representation used for iSGv2.0 and uncover a flaw in the modeling of indels used in current state of the art methods, which biases simulation results for hypotheses involving indels. We fix this flaw in iSGv2.0 by using a novel discrete stepping procedure. Finally, we present an example simulation of the calycin-superfamily sequences and compare the performance of iSGv2.0 with iSGv1.0 and random model of sequence evolution.

KW - Domains

KW - Indels

KW - Motifs

KW - Protein superfamily

KW - Sequence simulation

UR - http://www.scopus.com/inward/record.url?scp=73349096292&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=73349096292&partnerID=8YFLogxK

U2 - 10.1093/molbev/msp174

DO - 10.1093/molbev/msp174

M3 - Article

VL - 26

SP - 2581

EP - 2593

JO - Molecular Biology and Evolution

JF - Molecular Biology and Evolution

SN - 0737-4038

IS - 11

ER -