The impact of clustering methods for cross-validation, choice of phenotypes, and genotyping strategies on the accuracy of genomic predictions

Johnna L. Baller, Jeremy T. Howard, Stephen D. Kachman, Matthew L. Spangler

Research output: Contribution to journalArticle

Abstract

For genomic predictors to be of use in genetic evaluation, their predicted accuracy must be a reliable indicator of their utility, and thus unbiased. The objective of this paper was to evaluate the accuracy of prediction of genomic breeding values (GBV) using different clustering strategies and response variables. Red Angus genotypes (n = 9,763) were imputed to a reference 50K panel. The influence of clustering method [k-means, k-medoids, principal component (PC) analysis on the numerator relationship matrix (A) and the identical-by-state genomic relationship matrix (G) as both data and covariance matrices, and random] and response variables [deregressed estimated breeding values (DEBV) and adjusted phenotypes] were evaluated for cross-validation. The GBV were estimated using a Bayes C model for all traits. Traits for DEBV included birth weight (BWT), marbling (MARB), rib-eye area (REA), and yearling weight (YWT). Adjusted phenotypes included BWT, YWT, and ultrasonically measured intramuscular fat percentage and REA. Prediction accuracies were estimated using the genetic correlation between GBV and associated response variable using a bivariate animal model. A simulation mimicking a cattle population, replicated 5 times, was conducted to quantify differences between true and estimated accuracies. The simulation used the same clustering methods and response variables, with the addition of 2 genotyping strategies (random and top 25% of individuals), and forward validation. The prediction accuracies were estimated similarly, and true accuracies were estimated as the correlation between the residuals of a bivariate model including true breeding value (TBV) and GBV. Using the adjusted Rand index, random clusters were clearly different from relationship-based clustering methods. In both real and simulated data, random clustering consistently led to the largest estimates of accuracy, while no method was consistently associated with more or less bias than other methods. In simulation, random genotyping led to higher estimated accuracies than selection of the top 25% of individuals. Interestingly, random genotyping seemed to overpredict true accuracy while selective genotyping tended to underpredict accuracy. When forward in time validation was used, DEBV led to less biased estimates of GBV accuracy. Results suggest the highest, least biased GBV accuracies are associated with random genotyping and DEBV.

Original languageEnglish (US)
Pages (from-to)1534-1549
Number of pages16
JournalJournal of animal science
Volume97
Issue number4
DOIs
StatePublished - Apr 3 2019

Fingerprint

breeding value
genotyping
Breeding
Cluster Analysis
Phenotype
genomics
phenotype
prediction
methodology
Ribs
Birth Weight
ribs
yearlings
birth weight
eyes
Red Angus
Weights and Measures
marbling
Principal Component Analysis
intramuscular fat

Keywords

  • beef cattle
  • bias
  • genomic prediction
  • simulation

ASJC Scopus subject areas

  • Food Science
  • Animal Science and Zoology
  • Genetics

Cite this

The impact of clustering methods for cross-validation, choice of phenotypes, and genotyping strategies on the accuracy of genomic predictions. / Baller, Johnna L.; Howard, Jeremy T.; Kachman, Stephen D.; Spangler, Matthew L.

In: Journal of animal science, Vol. 97, No. 4, 03.04.2019, p. 1534-1549.

Research output: Contribution to journalArticle

@article{74fcc48d34f742d58b7848b346165a0d,
title = "The impact of clustering methods for cross-validation, choice of phenotypes, and genotyping strategies on the accuracy of genomic predictions",
abstract = "For genomic predictors to be of use in genetic evaluation, their predicted accuracy must be a reliable indicator of their utility, and thus unbiased. The objective of this paper was to evaluate the accuracy of prediction of genomic breeding values (GBV) using different clustering strategies and response variables. Red Angus genotypes (n = 9,763) were imputed to a reference 50K panel. The influence of clustering method [k-means, k-medoids, principal component (PC) analysis on the numerator relationship matrix (A) and the identical-by-state genomic relationship matrix (G) as both data and covariance matrices, and random] and response variables [deregressed estimated breeding values (DEBV) and adjusted phenotypes] were evaluated for cross-validation. The GBV were estimated using a Bayes C model for all traits. Traits for DEBV included birth weight (BWT), marbling (MARB), rib-eye area (REA), and yearling weight (YWT). Adjusted phenotypes included BWT, YWT, and ultrasonically measured intramuscular fat percentage and REA. Prediction accuracies were estimated using the genetic correlation between GBV and associated response variable using a bivariate animal model. A simulation mimicking a cattle population, replicated 5 times, was conducted to quantify differences between true and estimated accuracies. The simulation used the same clustering methods and response variables, with the addition of 2 genotyping strategies (random and top 25{\%} of individuals), and forward validation. The prediction accuracies were estimated similarly, and true accuracies were estimated as the correlation between the residuals of a bivariate model including true breeding value (TBV) and GBV. Using the adjusted Rand index, random clusters were clearly different from relationship-based clustering methods. In both real and simulated data, random clustering consistently led to the largest estimates of accuracy, while no method was consistently associated with more or less bias than other methods. In simulation, random genotyping led to higher estimated accuracies than selection of the top 25{\%} of individuals. Interestingly, random genotyping seemed to overpredict true accuracy while selective genotyping tended to underpredict accuracy. When forward in time validation was used, DEBV led to less biased estimates of GBV accuracy. Results suggest the highest, least biased GBV accuracies are associated with random genotyping and DEBV.",
keywords = "beef cattle, bias, genomic prediction, simulation",
author = "Baller, {Johnna L.} and Howard, {Jeremy T.} and Kachman, {Stephen D.} and Spangler, {Matthew L.}",
year = "2019",
month = "4",
day = "3",
doi = "10.1093/jas/skz055",
language = "English (US)",
volume = "97",
pages = "1534--1549",
journal = "Journal of Animal Science",
issn = "0021-8812",
publisher = "American Society of Animal Science",
number = "4",

}

TY - JOUR

T1 - The impact of clustering methods for cross-validation, choice of phenotypes, and genotyping strategies on the accuracy of genomic predictions

AU - Baller, Johnna L.

AU - Howard, Jeremy T.

AU - Kachman, Stephen D.

AU - Spangler, Matthew L.

PY - 2019/4/3

Y1 - 2019/4/3

N2 - For genomic predictors to be of use in genetic evaluation, their predicted accuracy must be a reliable indicator of their utility, and thus unbiased. The objective of this paper was to evaluate the accuracy of prediction of genomic breeding values (GBV) using different clustering strategies and response variables. Red Angus genotypes (n = 9,763) were imputed to a reference 50K panel. The influence of clustering method [k-means, k-medoids, principal component (PC) analysis on the numerator relationship matrix (A) and the identical-by-state genomic relationship matrix (G) as both data and covariance matrices, and random] and response variables [deregressed estimated breeding values (DEBV) and adjusted phenotypes] were evaluated for cross-validation. The GBV were estimated using a Bayes C model for all traits. Traits for DEBV included birth weight (BWT), marbling (MARB), rib-eye area (REA), and yearling weight (YWT). Adjusted phenotypes included BWT, YWT, and ultrasonically measured intramuscular fat percentage and REA. Prediction accuracies were estimated using the genetic correlation between GBV and associated response variable using a bivariate animal model. A simulation mimicking a cattle population, replicated 5 times, was conducted to quantify differences between true and estimated accuracies. The simulation used the same clustering methods and response variables, with the addition of 2 genotyping strategies (random and top 25% of individuals), and forward validation. The prediction accuracies were estimated similarly, and true accuracies were estimated as the correlation between the residuals of a bivariate model including true breeding value (TBV) and GBV. Using the adjusted Rand index, random clusters were clearly different from relationship-based clustering methods. In both real and simulated data, random clustering consistently led to the largest estimates of accuracy, while no method was consistently associated with more or less bias than other methods. In simulation, random genotyping led to higher estimated accuracies than selection of the top 25% of individuals. Interestingly, random genotyping seemed to overpredict true accuracy while selective genotyping tended to underpredict accuracy. When forward in time validation was used, DEBV led to less biased estimates of GBV accuracy. Results suggest the highest, least biased GBV accuracies are associated with random genotyping and DEBV.

AB - For genomic predictors to be of use in genetic evaluation, their predicted accuracy must be a reliable indicator of their utility, and thus unbiased. The objective of this paper was to evaluate the accuracy of prediction of genomic breeding values (GBV) using different clustering strategies and response variables. Red Angus genotypes (n = 9,763) were imputed to a reference 50K panel. The influence of clustering method [k-means, k-medoids, principal component (PC) analysis on the numerator relationship matrix (A) and the identical-by-state genomic relationship matrix (G) as both data and covariance matrices, and random] and response variables [deregressed estimated breeding values (DEBV) and adjusted phenotypes] were evaluated for cross-validation. The GBV were estimated using a Bayes C model for all traits. Traits for DEBV included birth weight (BWT), marbling (MARB), rib-eye area (REA), and yearling weight (YWT). Adjusted phenotypes included BWT, YWT, and ultrasonically measured intramuscular fat percentage and REA. Prediction accuracies were estimated using the genetic correlation between GBV and associated response variable using a bivariate animal model. A simulation mimicking a cattle population, replicated 5 times, was conducted to quantify differences between true and estimated accuracies. The simulation used the same clustering methods and response variables, with the addition of 2 genotyping strategies (random and top 25% of individuals), and forward validation. The prediction accuracies were estimated similarly, and true accuracies were estimated as the correlation between the residuals of a bivariate model including true breeding value (TBV) and GBV. Using the adjusted Rand index, random clusters were clearly different from relationship-based clustering methods. In both real and simulated data, random clustering consistently led to the largest estimates of accuracy, while no method was consistently associated with more or less bias than other methods. In simulation, random genotyping led to higher estimated accuracies than selection of the top 25% of individuals. Interestingly, random genotyping seemed to overpredict true accuracy while selective genotyping tended to underpredict accuracy. When forward in time validation was used, DEBV led to less biased estimates of GBV accuracy. Results suggest the highest, least biased GBV accuracies are associated with random genotyping and DEBV.

KW - beef cattle

KW - bias

KW - genomic prediction

KW - simulation

UR - http://www.scopus.com/inward/record.url?scp=85064119780&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064119780&partnerID=8YFLogxK

U2 - 10.1093/jas/skz055

DO - 10.1093/jas/skz055

M3 - Article

C2 - 30721970

AN - SCOPUS:85064119780

VL - 97

SP - 1534

EP - 1549

JO - Journal of Animal Science

JF - Journal of Animal Science

SN - 0021-8812

IS - 4

ER -