Classification of breast cancer patients using somatic mutation profiles and machine learning approaches

Suleyman Vural, Xiaosheng Wang, Chittibabu Guda

Research output: Contribution to journalArticle

14 Citations (Scopus)

Abstract

Background: The high degree of heterogeneity observed in breast cancers makes it very difficult to classify the cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. Several classification strategies based on ER/PR/HER2 expression or the expression profiles of a panel of genes have helped, but such methods often produce misleading results due to their dynamic nature. In contrast, somatic DNA mutations are relatively stable and lead to initiation and progression of many sporadic cancers. Hence in this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers. Results: We analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Somatic and non-synonymous single nucleotide variants identified from each patient were assigned a quantitative score (C-score) that represents the extent of negative impact on the gene function. Using these scores with non-negative matrix factorization method, we clustered the patients into three subgroups. By comparing the clinical stage of patients, we identified an early-stage-enriched and a late-stage-enriched subgroup. Comparison of the mutation scores of early and late-stage-enriched subgroups identified 358 genes that carry significantly higher mutations rates in the late stage subgroup. Functional characterization of these genes revealed important functional gene families that carry a heavy mutational load in the late state rich subgroup of patients. Finally, using the identified subgroups, we also developed a supervised classification model to predict the stage of the patients. Conclusions: This study demonstrates that gene mutation profiles can be effectively used with unsupervised machine-learning methods to identify clinically distinguishable breast cancer subgroups. The classification model developed in this method could provide a reasonable prediction of the cancer patients' stage solely based on their mutation profiles. This study represents the first use of only somatic mutation profile data to identify and predict breast cancer subgroups and this generic methodology can also be applied to other cancer datasets.

Original languageEnglish (US)
Article number62
JournalBMC Systems Biology
Volume10
DOIs
StatePublished - Aug 26 2016

Fingerprint

Breast Cancer
Learning systems
Machine Learning
Mutation
Genes
Subgroup
Breast Neoplasms
Gene
Cancer
Neoplasms
Predict
Exome
Classify
Profile
Atlases
Mutation Rate
Nucleotides
Factorization
Non-negative Matrix Factorization
Factorization Method

Keywords

  • Breast cancer classification
  • Breast cancer subtypes
  • Cancer stage prediction
  • Gene mutation profiles
  • TCGA
  • Unsupervised and supervised machine learning
  • Whole exome sequencing data analysis

ASJC Scopus subject areas

  • Structural Biology
  • Modeling and Simulation
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Classification of breast cancer patients using somatic mutation profiles and machine learning approaches. / Vural, Suleyman; Wang, Xiaosheng; Guda, Chittibabu.

In: BMC Systems Biology, Vol. 10, 62, 26.08.2016.

Research output: Contribution to journalArticle

@article{f95991e5ba2e41f59d5638abf95484cf,
title = "Classification of breast cancer patients using somatic mutation profiles and machine learning approaches",
abstract = "Background: The high degree of heterogeneity observed in breast cancers makes it very difficult to classify the cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. Several classification strategies based on ER/PR/HER2 expression or the expression profiles of a panel of genes have helped, but such methods often produce misleading results due to their dynamic nature. In contrast, somatic DNA mutations are relatively stable and lead to initiation and progression of many sporadic cancers. Hence in this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers. Results: We analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Somatic and non-synonymous single nucleotide variants identified from each patient were assigned a quantitative score (C-score) that represents the extent of negative impact on the gene function. Using these scores with non-negative matrix factorization method, we clustered the patients into three subgroups. By comparing the clinical stage of patients, we identified an early-stage-enriched and a late-stage-enriched subgroup. Comparison of the mutation scores of early and late-stage-enriched subgroups identified 358 genes that carry significantly higher mutations rates in the late stage subgroup. Functional characterization of these genes revealed important functional gene families that carry a heavy mutational load in the late state rich subgroup of patients. Finally, using the identified subgroups, we also developed a supervised classification model to predict the stage of the patients. Conclusions: This study demonstrates that gene mutation profiles can be effectively used with unsupervised machine-learning methods to identify clinically distinguishable breast cancer subgroups. The classification model developed in this method could provide a reasonable prediction of the cancer patients' stage solely based on their mutation profiles. This study represents the first use of only somatic mutation profile data to identify and predict breast cancer subgroups and this generic methodology can also be applied to other cancer datasets.",
keywords = "Breast cancer classification, Breast cancer subtypes, Cancer stage prediction, Gene mutation profiles, TCGA, Unsupervised and supervised machine learning, Whole exome sequencing data analysis",
author = "Suleyman Vural and Xiaosheng Wang and Chittibabu Guda",
year = "2016",
month = "8",
day = "26",
doi = "10.1186/s12918-016-0306-z",
language = "English (US)",
volume = "10",
journal = "BMC Systems Biology",
issn = "1752-0509",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Classification of breast cancer patients using somatic mutation profiles and machine learning approaches

AU - Vural, Suleyman

AU - Wang, Xiaosheng

AU - Guda, Chittibabu

PY - 2016/8/26

Y1 - 2016/8/26

N2 - Background: The high degree of heterogeneity observed in breast cancers makes it very difficult to classify the cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. Several classification strategies based on ER/PR/HER2 expression or the expression profiles of a panel of genes have helped, but such methods often produce misleading results due to their dynamic nature. In contrast, somatic DNA mutations are relatively stable and lead to initiation and progression of many sporadic cancers. Hence in this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers. Results: We analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Somatic and non-synonymous single nucleotide variants identified from each patient were assigned a quantitative score (C-score) that represents the extent of negative impact on the gene function. Using these scores with non-negative matrix factorization method, we clustered the patients into three subgroups. By comparing the clinical stage of patients, we identified an early-stage-enriched and a late-stage-enriched subgroup. Comparison of the mutation scores of early and late-stage-enriched subgroups identified 358 genes that carry significantly higher mutations rates in the late stage subgroup. Functional characterization of these genes revealed important functional gene families that carry a heavy mutational load in the late state rich subgroup of patients. Finally, using the identified subgroups, we also developed a supervised classification model to predict the stage of the patients. Conclusions: This study demonstrates that gene mutation profiles can be effectively used with unsupervised machine-learning methods to identify clinically distinguishable breast cancer subgroups. The classification model developed in this method could provide a reasonable prediction of the cancer patients' stage solely based on their mutation profiles. This study represents the first use of only somatic mutation profile data to identify and predict breast cancer subgroups and this generic methodology can also be applied to other cancer datasets.

AB - Background: The high degree of heterogeneity observed in breast cancers makes it very difficult to classify the cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. Several classification strategies based on ER/PR/HER2 expression or the expression profiles of a panel of genes have helped, but such methods often produce misleading results due to their dynamic nature. In contrast, somatic DNA mutations are relatively stable and lead to initiation and progression of many sporadic cancers. Hence in this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers. Results: We analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Somatic and non-synonymous single nucleotide variants identified from each patient were assigned a quantitative score (C-score) that represents the extent of negative impact on the gene function. Using these scores with non-negative matrix factorization method, we clustered the patients into three subgroups. By comparing the clinical stage of patients, we identified an early-stage-enriched and a late-stage-enriched subgroup. Comparison of the mutation scores of early and late-stage-enriched subgroups identified 358 genes that carry significantly higher mutations rates in the late stage subgroup. Functional characterization of these genes revealed important functional gene families that carry a heavy mutational load in the late state rich subgroup of patients. Finally, using the identified subgroups, we also developed a supervised classification model to predict the stage of the patients. Conclusions: This study demonstrates that gene mutation profiles can be effectively used with unsupervised machine-learning methods to identify clinically distinguishable breast cancer subgroups. The classification model developed in this method could provide a reasonable prediction of the cancer patients' stage solely based on their mutation profiles. This study represents the first use of only somatic mutation profile data to identify and predict breast cancer subgroups and this generic methodology can also be applied to other cancer datasets.

KW - Breast cancer classification

KW - Breast cancer subtypes

KW - Cancer stage prediction

KW - Gene mutation profiles

KW - TCGA

KW - Unsupervised and supervised machine learning

KW - Whole exome sequencing data analysis

UR - http://www.scopus.com/inward/record.url?scp=84983631327&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84983631327&partnerID=8YFLogxK

U2 - 10.1186/s12918-016-0306-z

DO - 10.1186/s12918-016-0306-z

M3 - Article

VL - 10

JO - BMC Systems Biology

JF - BMC Systems Biology

SN - 1752-0509

M1 - 62

ER -