Regular, median and Huber cross-validation: A computational comparison

Chi Wai Yu, Bertrand Clarke

Research output: Contribution to journalArticle

Abstract

We present a new technique for comparing models using a median form of cross-validation and least median of squares estimation (MCV-LMS). Rather than minimizing the sums of squares of residual errors, we minimize the median of the squared residual errors. We compare this with a robustified form of cross-validation using the Huber loss function and robust coefficient estimators (HCV). Through extensive simulations we find that for linear models MCV-LMS outperforms HCV for data that is representative of the data generator when the tails of the noise distribution are heavy enough and asymmetric enough. We also find that MCV-LMS is often better able to detect the presence of small terms. Otherwise, HCV typically outperforms MCV-LMS for 'good' data. MCV-LMS also outperforms HCV in the presence of enough severe outliers. One of MCV and HCV also generally gives better model selection for linear models than the conventional version of cross-validation with least squares estimators (CV-LS) when the tails of the noise distribution are heavy or asymmetric or when the coefficients are small and the data is representative. CV-LS only performs well when the tails of the error distribution are light and symmetric and the coefficients are large relative to the noise variance. Outside of these contexts and the contexts noted above, HCV outperforms CV-LS and MCV-LMS. We illustrate CV-LS, HVC, and MCV-LMS via numerous simulations to map out when each does best on representative data and then apply all three to a real dataset from econometrics that includes outliers. Statistical Analysis and Data Mining published by Wiley Periodicals, Inc.

Original languageEnglish (US)
Pages (from-to)14-33
Number of pages20
JournalStatistical Analysis and Data Mining
Volume8
Issue number1
DOIs
StatePublished - Feb 1 2015

Fingerprint

Cross-validation
Least Median of Squares
Least Squares Estimator
Tail
Outlier
Linear Model
Coefficient
Data mining
Statistical methods
Form
Sum of squares
Loss Function
Econometrics
Model Selection
Statistical Analysis
Data Mining
Simulation
Generator
Minimise
Estimator

Keywords

  • Cross-validation
  • Heavy-tailed errors
  • Model selection
  • Outliers
  • Robustness
  • Skewness
  • Sparsity

ASJC Scopus subject areas

  • Analysis
  • Information Systems
  • Computer Science Applications

Cite this

Regular, median and Huber cross-validation : A computational comparison. / Yu, Chi Wai; Clarke, Bertrand.

In: Statistical Analysis and Data Mining, Vol. 8, No. 1, 01.02.2015, p. 14-33.

Research output: Contribution to journalArticle

Yu, Chi Wai ; Clarke, Bertrand. / Regular, median and Huber cross-validation : A computational comparison. In: Statistical Analysis and Data Mining. 2015 ; Vol. 8, No. 1. pp. 14-33.
@article{48c89918ead2436fbd93b7c52140c520,
title = "Regular, median and Huber cross-validation: A computational comparison",
abstract = "We present a new technique for comparing models using a median form of cross-validation and least median of squares estimation (MCV-LMS). Rather than minimizing the sums of squares of residual errors, we minimize the median of the squared residual errors. We compare this with a robustified form of cross-validation using the Huber loss function and robust coefficient estimators (HCV). Through extensive simulations we find that for linear models MCV-LMS outperforms HCV for data that is representative of the data generator when the tails of the noise distribution are heavy enough and asymmetric enough. We also find that MCV-LMS is often better able to detect the presence of small terms. Otherwise, HCV typically outperforms MCV-LMS for 'good' data. MCV-LMS also outperforms HCV in the presence of enough severe outliers. One of MCV and HCV also generally gives better model selection for linear models than the conventional version of cross-validation with least squares estimators (CV-LS) when the tails of the noise distribution are heavy or asymmetric or when the coefficients are small and the data is representative. CV-LS only performs well when the tails of the error distribution are light and symmetric and the coefficients are large relative to the noise variance. Outside of these contexts and the contexts noted above, HCV outperforms CV-LS and MCV-LMS. We illustrate CV-LS, HVC, and MCV-LMS via numerous simulations to map out when each does best on representative data and then apply all three to a real dataset from econometrics that includes outliers. Statistical Analysis and Data Mining published by Wiley Periodicals, Inc.",
keywords = "Cross-validation, Heavy-tailed errors, Model selection, Outliers, Robustness, Skewness, Sparsity",
author = "Yu, {Chi Wai} and Bertrand Clarke",
year = "2015",
month = "2",
day = "1",
doi = "10.1002/sam.11254",
language = "English (US)",
volume = "8",
pages = "14--33",
journal = "Statistical Analysis and Data Mining",
issn = "1932-1872",
publisher = "John Wiley and Sons Inc.",
number = "1",

}

TY - JOUR

T1 - Regular, median and Huber cross-validation

T2 - A computational comparison

AU - Yu, Chi Wai

AU - Clarke, Bertrand

PY - 2015/2/1

Y1 - 2015/2/1

N2 - We present a new technique for comparing models using a median form of cross-validation and least median of squares estimation (MCV-LMS). Rather than minimizing the sums of squares of residual errors, we minimize the median of the squared residual errors. We compare this with a robustified form of cross-validation using the Huber loss function and robust coefficient estimators (HCV). Through extensive simulations we find that for linear models MCV-LMS outperforms HCV for data that is representative of the data generator when the tails of the noise distribution are heavy enough and asymmetric enough. We also find that MCV-LMS is often better able to detect the presence of small terms. Otherwise, HCV typically outperforms MCV-LMS for 'good' data. MCV-LMS also outperforms HCV in the presence of enough severe outliers. One of MCV and HCV also generally gives better model selection for linear models than the conventional version of cross-validation with least squares estimators (CV-LS) when the tails of the noise distribution are heavy or asymmetric or when the coefficients are small and the data is representative. CV-LS only performs well when the tails of the error distribution are light and symmetric and the coefficients are large relative to the noise variance. Outside of these contexts and the contexts noted above, HCV outperforms CV-LS and MCV-LMS. We illustrate CV-LS, HVC, and MCV-LMS via numerous simulations to map out when each does best on representative data and then apply all three to a real dataset from econometrics that includes outliers. Statistical Analysis and Data Mining published by Wiley Periodicals, Inc.

AB - We present a new technique for comparing models using a median form of cross-validation and least median of squares estimation (MCV-LMS). Rather than minimizing the sums of squares of residual errors, we minimize the median of the squared residual errors. We compare this with a robustified form of cross-validation using the Huber loss function and robust coefficient estimators (HCV). Through extensive simulations we find that for linear models MCV-LMS outperforms HCV for data that is representative of the data generator when the tails of the noise distribution are heavy enough and asymmetric enough. We also find that MCV-LMS is often better able to detect the presence of small terms. Otherwise, HCV typically outperforms MCV-LMS for 'good' data. MCV-LMS also outperforms HCV in the presence of enough severe outliers. One of MCV and HCV also generally gives better model selection for linear models than the conventional version of cross-validation with least squares estimators (CV-LS) when the tails of the noise distribution are heavy or asymmetric or when the coefficients are small and the data is representative. CV-LS only performs well when the tails of the error distribution are light and symmetric and the coefficients are large relative to the noise variance. Outside of these contexts and the contexts noted above, HCV outperforms CV-LS and MCV-LMS. We illustrate CV-LS, HVC, and MCV-LMS via numerous simulations to map out when each does best on representative data and then apply all three to a real dataset from econometrics that includes outliers. Statistical Analysis and Data Mining published by Wiley Periodicals, Inc.

KW - Cross-validation

KW - Heavy-tailed errors

KW - Model selection

KW - Outliers

KW - Robustness

KW - Skewness

KW - Sparsity

UR - http://www.scopus.com/inward/record.url?scp=84921756539&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921756539&partnerID=8YFLogxK

U2 - 10.1002/sam.11254

DO - 10.1002/sam.11254

M3 - Article

AN - SCOPUS:84921756539

VL - 8

SP - 14

EP - 33

JO - Statistical Analysis and Data Mining

JF - Statistical Analysis and Data Mining

SN - 1932-1872

IS - 1

ER -