On the limits of clustering in high dimensions via cost functions

Hoyt A. Koepke, Bertrand S. Clarke

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

This paper establishes a negative result for clustering: above a certain ratio of random noise to nonrandom information, it is impossible for a large class of cost functions to distinguish between two partitions of a data set. In particular, it is shown that as the dimension increases, the ability to distinguish an accurate partitioning from an inaccurate one is lost unless the informative components are both sufficiently numerous and sufficiently informative. We examine squared error cost functions in detail. More generally, it is seen that the VC-dimension is an essential hypothesis for the class of cost functions to satisfy for an impossibility proof to be feasible. Separately, we provide bounds on the probabilistic behavior of cost functions that show how rapidly the ability to distinguish two clusterings decays. In two examples, one simulated and one with genomic data, bounds on the ability of squared-error and other cost functions to distinguish between two partitions are computed. Thus, one should not rely on clustering results alone for high dimensional low sample size data and one should do feature selection.

Original languageEnglish (US)
Pages (from-to)30-53
Number of pages24
JournalStatistical Analysis and Data Mining
Volume4
Issue number1
DOIs
StatePublished - Feb 1 2011

Fingerprint

Dimension Function
Cost functions
Higher Dimensions
Cost Function
Clustering
Partition
VC Dimension
Random Noise
Error function
Inaccurate
Feature Selection
Genomics
Feature extraction
Partitioning
Sample Size
High-dimensional
Decay

Keywords

  • Clustering impossibility
  • Cost function
  • High dimensions
  • VC-dimension

ASJC Scopus subject areas

  • Analysis
  • Information Systems
  • Computer Science Applications

Cite this

On the limits of clustering in high dimensions via cost functions. / Koepke, Hoyt A.; Clarke, Bertrand S.

In: Statistical Analysis and Data Mining, Vol. 4, No. 1, 01.02.2011, p. 30-53.

Research output: Contribution to journalArticle

Koepke, Hoyt A. ; Clarke, Bertrand S. / On the limits of clustering in high dimensions via cost functions. In: Statistical Analysis and Data Mining. 2011 ; Vol. 4, No. 1. pp. 30-53.
@article{f84250996d6e45d883ed219e06d80289,
title = "On the limits of clustering in high dimensions via cost functions",
abstract = "This paper establishes a negative result for clustering: above a certain ratio of random noise to nonrandom information, it is impossible for a large class of cost functions to distinguish between two partitions of a data set. In particular, it is shown that as the dimension increases, the ability to distinguish an accurate partitioning from an inaccurate one is lost unless the informative components are both sufficiently numerous and sufficiently informative. We examine squared error cost functions in detail. More generally, it is seen that the VC-dimension is an essential hypothesis for the class of cost functions to satisfy for an impossibility proof to be feasible. Separately, we provide bounds on the probabilistic behavior of cost functions that show how rapidly the ability to distinguish two clusterings decays. In two examples, one simulated and one with genomic data, bounds on the ability of squared-error and other cost functions to distinguish between two partitions are computed. Thus, one should not rely on clustering results alone for high dimensional low sample size data and one should do feature selection.",
keywords = "Clustering impossibility, Cost function, High dimensions, VC-dimension",
author = "Koepke, {Hoyt A.} and Clarke, {Bertrand S.}",
year = "2011",
month = "2",
day = "1",
doi = "10.1002/sam.10095",
language = "English (US)",
volume = "4",
pages = "30--53",
journal = "Statistical Analysis and Data Mining",
issn = "1932-1872",
publisher = "John Wiley and Sons Inc.",
number = "1",

}

TY - JOUR

T1 - On the limits of clustering in high dimensions via cost functions

AU - Koepke, Hoyt A.

AU - Clarke, Bertrand S.

PY - 2011/2/1

Y1 - 2011/2/1

N2 - This paper establishes a negative result for clustering: above a certain ratio of random noise to nonrandom information, it is impossible for a large class of cost functions to distinguish between two partitions of a data set. In particular, it is shown that as the dimension increases, the ability to distinguish an accurate partitioning from an inaccurate one is lost unless the informative components are both sufficiently numerous and sufficiently informative. We examine squared error cost functions in detail. More generally, it is seen that the VC-dimension is an essential hypothesis for the class of cost functions to satisfy for an impossibility proof to be feasible. Separately, we provide bounds on the probabilistic behavior of cost functions that show how rapidly the ability to distinguish two clusterings decays. In two examples, one simulated and one with genomic data, bounds on the ability of squared-error and other cost functions to distinguish between two partitions are computed. Thus, one should not rely on clustering results alone for high dimensional low sample size data and one should do feature selection.

AB - This paper establishes a negative result for clustering: above a certain ratio of random noise to nonrandom information, it is impossible for a large class of cost functions to distinguish between two partitions of a data set. In particular, it is shown that as the dimension increases, the ability to distinguish an accurate partitioning from an inaccurate one is lost unless the informative components are both sufficiently numerous and sufficiently informative. We examine squared error cost functions in detail. More generally, it is seen that the VC-dimension is an essential hypothesis for the class of cost functions to satisfy for an impossibility proof to be feasible. Separately, we provide bounds on the probabilistic behavior of cost functions that show how rapidly the ability to distinguish two clusterings decays. In two examples, one simulated and one with genomic data, bounds on the ability of squared-error and other cost functions to distinguish between two partitions are computed. Thus, one should not rely on clustering results alone for high dimensional low sample size data and one should do feature selection.

KW - Clustering impossibility

KW - Cost function

KW - High dimensions

KW - VC-dimension

UR - http://www.scopus.com/inward/record.url?scp=79551701149&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79551701149&partnerID=8YFLogxK

U2 - 10.1002/sam.10095

DO - 10.1002/sam.10095

M3 - Article

AN - SCOPUS:79551701149

VL - 4

SP - 30

EP - 53

JO - Statistical Analysis and Data Mining

JF - Statistical Analysis and Data Mining

SN - 1932-1872

IS - 1

ER -