A bayesian criterion for cluster stability

Hoyt Koepke, Bertrand S Clarke

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

We present a technique for evaluating and comparing how clusterings reveal structure inherent in the data set. Our technique is based on a criterion evaluating how much point-to-cluster distances may be perturbed without affecting the membership of the points. Although similar to some existing perturbation methods, our approach distinguishes itself in five ways. First, the strength of the perturbations is indexed by a prior distribution controlling how close to boundary regions a point may be before it is considered unstable. Second, our approach is exact in that we integrate over all the perturbations; in practice, this can be done efficiently for well-chosen prior distributions. Third, we provide a rigorous theoretical treatment of the approach, showing that it is consistent for estimating the correct number of clusters. Fourth, it yields a detailed picture of the behavior and structure of the clustering. Finally, it is computationally tractable and easy to use, requiring only a point-to-cluster distance matrix as input. In a simulation study, we show that it outperforms several existing methods in terms of recovering the correct number of clusters. We also illustrate the technique in three real data sets.

Original languageEnglish (US)
Pages (from-to)346-374
Number of pages29
JournalStatistical Analysis and Data Mining
Volume6
Issue number4
DOIs
StatePublished - Aug 1 2013

Fingerprint

Number of Clusters
Prior distribution
Clustering
Perturbation
Distance Matrix
Perturbation Method
Unstable
Integrate
Simulation Study

Keywords

  • Bayesian
  • Clustering
  • Consistency
  • Heatmap
  • Stability

ASJC Scopus subject areas

  • Analysis
  • Information Systems
  • Computer Science Applications

Cite this

A bayesian criterion for cluster stability. / Koepke, Hoyt; Clarke, Bertrand S.

In: Statistical Analysis and Data Mining, Vol. 6, No. 4, 01.08.2013, p. 346-374.

Research output: Contribution to journalArticle

Koepke, Hoyt ; Clarke, Bertrand S. / A bayesian criterion for cluster stability. In: Statistical Analysis and Data Mining. 2013 ; Vol. 6, No. 4. pp. 346-374.
@article{4326ed954225454486ac26ac1f575146,
title = "A bayesian criterion for cluster stability",
abstract = "We present a technique for evaluating and comparing how clusterings reveal structure inherent in the data set. Our technique is based on a criterion evaluating how much point-to-cluster distances may be perturbed without affecting the membership of the points. Although similar to some existing perturbation methods, our approach distinguishes itself in five ways. First, the strength of the perturbations is indexed by a prior distribution controlling how close to boundary regions a point may be before it is considered unstable. Second, our approach is exact in that we integrate over all the perturbations; in practice, this can be done efficiently for well-chosen prior distributions. Third, we provide a rigorous theoretical treatment of the approach, showing that it is consistent for estimating the correct number of clusters. Fourth, it yields a detailed picture of the behavior and structure of the clustering. Finally, it is computationally tractable and easy to use, requiring only a point-to-cluster distance matrix as input. In a simulation study, we show that it outperforms several existing methods in terms of recovering the correct number of clusters. We also illustrate the technique in three real data sets.",
keywords = "Bayesian, Clustering, Consistency, Heatmap, Stability",
author = "Hoyt Koepke and Clarke, {Bertrand S}",
year = "2013",
month = "8",
day = "1",
doi = "10.1002/sam.11176",
language = "English (US)",
volume = "6",
pages = "346--374",
journal = "Statistical Analysis and Data Mining",
issn = "1932-1872",
publisher = "John Wiley and Sons Inc.",
number = "4",

}

TY - JOUR

T1 - A bayesian criterion for cluster stability

AU - Koepke, Hoyt

AU - Clarke, Bertrand S

PY - 2013/8/1

Y1 - 2013/8/1

N2 - We present a technique for evaluating and comparing how clusterings reveal structure inherent in the data set. Our technique is based on a criterion evaluating how much point-to-cluster distances may be perturbed without affecting the membership of the points. Although similar to some existing perturbation methods, our approach distinguishes itself in five ways. First, the strength of the perturbations is indexed by a prior distribution controlling how close to boundary regions a point may be before it is considered unstable. Second, our approach is exact in that we integrate over all the perturbations; in practice, this can be done efficiently for well-chosen prior distributions. Third, we provide a rigorous theoretical treatment of the approach, showing that it is consistent for estimating the correct number of clusters. Fourth, it yields a detailed picture of the behavior and structure of the clustering. Finally, it is computationally tractable and easy to use, requiring only a point-to-cluster distance matrix as input. In a simulation study, we show that it outperforms several existing methods in terms of recovering the correct number of clusters. We also illustrate the technique in three real data sets.

AB - We present a technique for evaluating and comparing how clusterings reveal structure inherent in the data set. Our technique is based on a criterion evaluating how much point-to-cluster distances may be perturbed without affecting the membership of the points. Although similar to some existing perturbation methods, our approach distinguishes itself in five ways. First, the strength of the perturbations is indexed by a prior distribution controlling how close to boundary regions a point may be before it is considered unstable. Second, our approach is exact in that we integrate over all the perturbations; in practice, this can be done efficiently for well-chosen prior distributions. Third, we provide a rigorous theoretical treatment of the approach, showing that it is consistent for estimating the correct number of clusters. Fourth, it yields a detailed picture of the behavior and structure of the clustering. Finally, it is computationally tractable and easy to use, requiring only a point-to-cluster distance matrix as input. In a simulation study, we show that it outperforms several existing methods in terms of recovering the correct number of clusters. We also illustrate the technique in three real data sets.

KW - Bayesian

KW - Clustering

KW - Consistency

KW - Heatmap

KW - Stability

UR - http://www.scopus.com/inward/record.url?scp=84880990208&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880990208&partnerID=8YFLogxK

U2 - 10.1002/sam.11176

DO - 10.1002/sam.11176

M3 - Article

VL - 6

SP - 346

EP - 374

JO - Statistical Analysis and Data Mining

JF - Statistical Analysis and Data Mining

SN - 1932-1872

IS - 4

ER -