Regular, median and Huber cross-validation: A computational comparison

Chi Wai Yu, Bertrand Clarke

Research output: Contribution to journalArticle

Abstract

We present a new technique for comparing models using a median form of cross-validation and least median of squares estimation (MCV-LMS). Rather than minimizing the sums of squares of residual errors, we minimize the median of the squared residual errors. We compare this with a robustified form of cross-validation using the Huber loss function and robust coefficient estimators (HCV). Through extensive simulations we find that for linear models MCV-LMS outperforms HCV for data that is representative of the data generator when the tails of the noise distribution are heavy enough and asymmetric enough. We also find that MCV-LMS is often better able to detect the presence of small terms. Otherwise, HCV typically outperforms MCV-LMS for 'good' data. MCV-LMS also outperforms HCV in the presence of enough severe outliers. One of MCV and HCV also generally gives better model selection for linear models than the conventional version of cross-validation with least squares estimators (CV-LS) when the tails of the noise distribution are heavy or asymmetric or when the coefficients are small and the data is representative. CV-LS only performs well when the tails of the error distribution are light and symmetric and the coefficients are large relative to the noise variance. Outside of these contexts and the contexts noted above, HCV outperforms CV-LS and MCV-LMS. We illustrate CV-LS, HVC, and MCV-LMS via numerous simulations to map out when each does best on representative data and then apply all three to a real dataset from econometrics that includes outliers. Statistical Analysis and Data Mining published by Wiley Periodicals, Inc.

Original languageEnglish (US)
Pages (from-to)14-33
Number of pages20
JournalStatistical Analysis and Data Mining
Volume8
Issue number1
DOIs
Publication statusPublished - Feb 1 2015

    Fingerprint

Keywords

  • Cross-validation
  • Heavy-tailed errors
  • Model selection
  • Outliers
  • Robustness
  • Skewness
  • Sparsity

ASJC Scopus subject areas

  • Analysis
  • Information Systems
  • Computer Science Applications

Cite this