Optimal data partitioning and a test case for ray-finned fishes (Actinopterygii) based on ten nuclear loci

Chenhong Li, Guoqing Lu, Guillermo Ortí

Research output: Contribution to journalArticle

158 Citations (Scopus)

Abstract

Data partitioning, the combined phylogenetic analysis of homogeneous blocks of data, is a common strategy used to accommodate heterogeneities in complex multilocus data sets. Variation in evolutionary rates and substitution patterns among sites are typically addressed by partitioning data by gene, codon position, or both. Excessive partitioning of the data, however, could lead to overparameterization; therefore, it seems critical to define the minimum numbers of partitions necessary to improve the overall fit of the model. We propose a new method, based on cluster analysis, to find an optimal partitioning strategy for multilocus protein-coding data sets. A heuristic exploration of alternative partitioning schemes, based on Bayesian and maximum likelihood (ML) criteria, is shown here to produce an optimal number of partitions. We tested this method using sequence data of 10 nuclear genes collected from 52 ray-finned fish (Actinopterygii) and four tetrapods. The concatenated sequences included 7995 nucleotide sites maximally split into 30 partitions defined a priori based on gene and codon position. Our results show that a model based on only 10 partitions defined by cluster analysis performed better than partitioning by both gene and codon position. Alternative data partitioning schemes also are shown to affect the topologies resulting from phylogenetic analysis, especially when Bayesian methods are used, suggesting that overpartitioning may be of major concern. The phylogenetic relationships among the major clades of ray-finned fish were assessed using the best data-partitioning schemes under ML and Bayesian methods. Some significant results include the monophyly of "Holostei" (Amia and Lepisosteus), the sister-group relationships between (1) esociforms and salmoniforms and (2) osmeriforms and stomiiforms, the polyphyly of Perciformes, and a close relationship of cichlids and atherinomorphs.

Original languageEnglish (US)
Pages (from-to)519-539
Number of pages21
JournalSystematic Biology
Volume57
Issue number4
DOIs
StatePublished - Aug 1 2008

Fingerprint

Skates (Fish)
Gene Order
Actinopterygii
Codon
Bayes Theorem
partitioning
codons
loci
Cluster Analysis
fish
Bayesian theory
Perciformes
Cichlids
phylogeny
cluster analysis
Amia
genes
Lepisosteus
testing
polyphyly

Keywords

  • Actinopterygii
  • Cluster analysis
  • Data partitioning
  • Holostei
  • Nuclear loci
  • Phylogenetics
  • Ray-finned fish

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Genetics

Cite this

Optimal data partitioning and a test case for ray-finned fishes (Actinopterygii) based on ten nuclear loci. / Li, Chenhong; Lu, Guoqing; Ortí, Guillermo.

In: Systematic Biology, Vol. 57, No. 4, 01.08.2008, p. 519-539.

Research output: Contribution to journalArticle

@article{26f71b29349c43df96b1ee19a19cf4e1,
title = "Optimal data partitioning and a test case for ray-finned fishes (Actinopterygii) based on ten nuclear loci",
abstract = "Data partitioning, the combined phylogenetic analysis of homogeneous blocks of data, is a common strategy used to accommodate heterogeneities in complex multilocus data sets. Variation in evolutionary rates and substitution patterns among sites are typically addressed by partitioning data by gene, codon position, or both. Excessive partitioning of the data, however, could lead to overparameterization; therefore, it seems critical to define the minimum numbers of partitions necessary to improve the overall fit of the model. We propose a new method, based on cluster analysis, to find an optimal partitioning strategy for multilocus protein-coding data sets. A heuristic exploration of alternative partitioning schemes, based on Bayesian and maximum likelihood (ML) criteria, is shown here to produce an optimal number of partitions. We tested this method using sequence data of 10 nuclear genes collected from 52 ray-finned fish (Actinopterygii) and four tetrapods. The concatenated sequences included 7995 nucleotide sites maximally split into 30 partitions defined a priori based on gene and codon position. Our results show that a model based on only 10 partitions defined by cluster analysis performed better than partitioning by both gene and codon position. Alternative data partitioning schemes also are shown to affect the topologies resulting from phylogenetic analysis, especially when Bayesian methods are used, suggesting that overpartitioning may be of major concern. The phylogenetic relationships among the major clades of ray-finned fish were assessed using the best data-partitioning schemes under ML and Bayesian methods. Some significant results include the monophyly of {"}Holostei{"} (Amia and Lepisosteus), the sister-group relationships between (1) esociforms and salmoniforms and (2) osmeriforms and stomiiforms, the polyphyly of Perciformes, and a close relationship of cichlids and atherinomorphs.",
keywords = "Actinopterygii, Cluster analysis, Data partitioning, Holostei, Nuclear loci, Phylogenetics, Ray-finned fish",
author = "Chenhong Li and Guoqing Lu and Guillermo Ort{\'i}",
year = "2008",
month = "8",
day = "1",
doi = "10.1080/10635150802206883",
language = "English (US)",
volume = "57",
pages = "519--539",
journal = "Systematic Biology",
issn = "1063-5157",
publisher = "Oxford University Press",
number = "4",

}

TY - JOUR

T1 - Optimal data partitioning and a test case for ray-finned fishes (Actinopterygii) based on ten nuclear loci

AU - Li, Chenhong

AU - Lu, Guoqing

AU - Ortí, Guillermo

PY - 2008/8/1

Y1 - 2008/8/1

N2 - Data partitioning, the combined phylogenetic analysis of homogeneous blocks of data, is a common strategy used to accommodate heterogeneities in complex multilocus data sets. Variation in evolutionary rates and substitution patterns among sites are typically addressed by partitioning data by gene, codon position, or both. Excessive partitioning of the data, however, could lead to overparameterization; therefore, it seems critical to define the minimum numbers of partitions necessary to improve the overall fit of the model. We propose a new method, based on cluster analysis, to find an optimal partitioning strategy for multilocus protein-coding data sets. A heuristic exploration of alternative partitioning schemes, based on Bayesian and maximum likelihood (ML) criteria, is shown here to produce an optimal number of partitions. We tested this method using sequence data of 10 nuclear genes collected from 52 ray-finned fish (Actinopterygii) and four tetrapods. The concatenated sequences included 7995 nucleotide sites maximally split into 30 partitions defined a priori based on gene and codon position. Our results show that a model based on only 10 partitions defined by cluster analysis performed better than partitioning by both gene and codon position. Alternative data partitioning schemes also are shown to affect the topologies resulting from phylogenetic analysis, especially when Bayesian methods are used, suggesting that overpartitioning may be of major concern. The phylogenetic relationships among the major clades of ray-finned fish were assessed using the best data-partitioning schemes under ML and Bayesian methods. Some significant results include the monophyly of "Holostei" (Amia and Lepisosteus), the sister-group relationships between (1) esociforms and salmoniforms and (2) osmeriforms and stomiiforms, the polyphyly of Perciformes, and a close relationship of cichlids and atherinomorphs.

AB - Data partitioning, the combined phylogenetic analysis of homogeneous blocks of data, is a common strategy used to accommodate heterogeneities in complex multilocus data sets. Variation in evolutionary rates and substitution patterns among sites are typically addressed by partitioning data by gene, codon position, or both. Excessive partitioning of the data, however, could lead to overparameterization; therefore, it seems critical to define the minimum numbers of partitions necessary to improve the overall fit of the model. We propose a new method, based on cluster analysis, to find an optimal partitioning strategy for multilocus protein-coding data sets. A heuristic exploration of alternative partitioning schemes, based on Bayesian and maximum likelihood (ML) criteria, is shown here to produce an optimal number of partitions. We tested this method using sequence data of 10 nuclear genes collected from 52 ray-finned fish (Actinopterygii) and four tetrapods. The concatenated sequences included 7995 nucleotide sites maximally split into 30 partitions defined a priori based on gene and codon position. Our results show that a model based on only 10 partitions defined by cluster analysis performed better than partitioning by both gene and codon position. Alternative data partitioning schemes also are shown to affect the topologies resulting from phylogenetic analysis, especially when Bayesian methods are used, suggesting that overpartitioning may be of major concern. The phylogenetic relationships among the major clades of ray-finned fish were assessed using the best data-partitioning schemes under ML and Bayesian methods. Some significant results include the monophyly of "Holostei" (Amia and Lepisosteus), the sister-group relationships between (1) esociforms and salmoniforms and (2) osmeriforms and stomiiforms, the polyphyly of Perciformes, and a close relationship of cichlids and atherinomorphs.

KW - Actinopterygii

KW - Cluster analysis

KW - Data partitioning

KW - Holostei

KW - Nuclear loci

KW - Phylogenetics

KW - Ray-finned fish

UR - http://www.scopus.com/inward/record.url?scp=47249090871&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=47249090871&partnerID=8YFLogxK

U2 - 10.1080/10635150802206883

DO - 10.1080/10635150802206883

M3 - Article

C2 - 18622808

AN - SCOPUS:47249090871

VL - 57

SP - 519

EP - 539

JO - Systematic Biology

JF - Systematic Biology

SN - 1063-5157

IS - 4

ER -