Model-based clustering with certainty estimation

Implication for clade assignment of influenza viruses

Shunpu Zhang, Zhong Li, Kevin Beland, Guoqing Lu

Research output: Contribution to journalArticle

Abstract

Background: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.

Original languageEnglish (US)
Article number287
JournalBMC Bioinformatics
Volume17
Issue number1
DOIs
StatePublished - Jul 21 2016

Fingerprint

Model-based Clustering
Influenza
Hemagglutinins
Orthomyxoviridae
Viruses
Virus
Cluster Analysis
Assignment
Visualization
Viral Hemagglutinins
Human Influenza
Clustering
Clustering Analysis
3D Visualization
Bootstrap Method
Bootstrap
Influenza in Birds
Influenza A virus
Sequence Homology
Sequence Analysis

Keywords

  • Bootstrap
  • Certainty
  • Influenza A hemagglutinin (HA)
  • Model-based clustering
  • Multidimensional scaling

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Medicine(all)
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Model-based clustering with certainty estimation : Implication for clade assignment of influenza viruses. / Zhang, Shunpu; Li, Zhong; Beland, Kevin; Lu, Guoqing.

In: BMC Bioinformatics, Vol. 17, No. 1, 287, 21.07.2016.

Research output: Contribution to journalArticle

@article{29115059a9264420968f595f42592c13,
title = "Model-based clustering with certainty estimation: Implication for clade assignment of influenza viruses",
abstract = "Background: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.",
keywords = "Bootstrap, Certainty, Influenza A hemagglutinin (HA), Model-based clustering, Multidimensional scaling",
author = "Shunpu Zhang and Zhong Li and Kevin Beland and Guoqing Lu",
year = "2016",
month = "7",
day = "21",
doi = "10.1186/s12859-016-1147-x",
language = "English (US)",
volume = "17",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Model-based clustering with certainty estimation

T2 - Implication for clade assignment of influenza viruses

AU - Zhang, Shunpu

AU - Li, Zhong

AU - Beland, Kevin

AU - Lu, Guoqing

PY - 2016/7/21

Y1 - 2016/7/21

N2 - Background: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.

AB - Background: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.

KW - Bootstrap

KW - Certainty

KW - Influenza A hemagglutinin (HA)

KW - Model-based clustering

KW - Multidimensional scaling

UR - http://www.scopus.com/inward/record.url?scp=84978795065&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978795065&partnerID=8YFLogxK

U2 - 10.1186/s12859-016-1147-x

DO - 10.1186/s12859-016-1147-x

M3 - Article

VL - 17

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 287

ER -