A General Hybrid Clustering Technique

Saeid Amiri, Bertrand S. Clarke, Jennifer L. Clarke, Hoyt Koepke

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Here, we propose a clustering technique for general clustering problems including those that have nonconvex clusters. For a given desired number of clusters K, we use three stages to find clusters. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). The key step in this stage is to find a K-means clustering using (Formula presented.) clusters where (Formula presented.) and then join these small clusters by using single linkage clustering. The second stage stabilizes the result of stage one by reclustering via the “membership matrix” under Hamming distance to generate a dendrogram. The third stage is to cut the dendrogram to get (Formula presented.) clusters where (Formula presented.) and then prune back to K to give a final clustering. A variant on our technique also gives a reasonable estimate for KT, the true number of clusters. We provide arguments to justify the steps in the stages of our methods and we provide examples involving simulated and published data to compare our technique with other techniques. An R library, GHC, implementing our method is available from Github.

Original languageEnglish (US)
Pages (from-to)540-551
Number of pages12
JournalJournal of Computational and Graphical Statistics
Volume28
Issue number3
DOIs
StatePublished - Jul 3 2019

Fingerprint

Clustering
Dendrogram
Number of Clusters
Hamming Distance
K-means Clustering
Linkage
Justify
Join
Series
Estimate

Keywords

  • Consistency
  • K-mean
  • Lifetime
  • Outlier
  • Single linkage
  • t-SNE

ASJC Scopus subject areas

  • Statistics and Probability
  • Discrete Mathematics and Combinatorics
  • Statistics, Probability and Uncertainty

Cite this

A General Hybrid Clustering Technique. / Amiri, Saeid; Clarke, Bertrand S.; Clarke, Jennifer L.; Koepke, Hoyt.

In: Journal of Computational and Graphical Statistics, Vol. 28, No. 3, 03.07.2019, p. 540-551.

Research output: Contribution to journalArticle

Amiri, Saeid ; Clarke, Bertrand S. ; Clarke, Jennifer L. ; Koepke, Hoyt. / A General Hybrid Clustering Technique. In: Journal of Computational and Graphical Statistics. 2019 ; Vol. 28, No. 3. pp. 540-551.
@article{436032b0e7ab40048be55b1854d214ef,
title = "A General Hybrid Clustering Technique",
abstract = "Here, we propose a clustering technique for general clustering problems including those that have nonconvex clusters. For a given desired number of clusters K, we use three stages to find clusters. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). The key step in this stage is to find a K-means clustering using (Formula presented.) clusters where (Formula presented.) and then join these small clusters by using single linkage clustering. The second stage stabilizes the result of stage one by reclustering via the “membership matrix” under Hamming distance to generate a dendrogram. The third stage is to cut the dendrogram to get (Formula presented.) clusters where (Formula presented.) and then prune back to K to give a final clustering. A variant on our technique also gives a reasonable estimate for KT, the true number of clusters. We provide arguments to justify the steps in the stages of our methods and we provide examples involving simulated and published data to compare our technique with other techniques. An R library, GHC, implementing our method is available from Github.",
keywords = "Consistency, K-mean, Lifetime, Outlier, Single linkage, t-SNE",
author = "Saeid Amiri and Clarke, {Bertrand S.} and Clarke, {Jennifer L.} and Hoyt Koepke",
year = "2019",
month = "7",
day = "3",
doi = "10.1080/10618600.2018.1546593",
language = "English (US)",
volume = "28",
pages = "540--551",
journal = "Journal of Computational and Graphical Statistics",
issn = "1061-8600",
publisher = "American Statistical Association",
number = "3",

}

TY - JOUR

T1 - A General Hybrid Clustering Technique

AU - Amiri, Saeid

AU - Clarke, Bertrand S.

AU - Clarke, Jennifer L.

AU - Koepke, Hoyt

PY - 2019/7/3

Y1 - 2019/7/3

N2 - Here, we propose a clustering technique for general clustering problems including those that have nonconvex clusters. For a given desired number of clusters K, we use three stages to find clusters. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). The key step in this stage is to find a K-means clustering using (Formula presented.) clusters where (Formula presented.) and then join these small clusters by using single linkage clustering. The second stage stabilizes the result of stage one by reclustering via the “membership matrix” under Hamming distance to generate a dendrogram. The third stage is to cut the dendrogram to get (Formula presented.) clusters where (Formula presented.) and then prune back to K to give a final clustering. A variant on our technique also gives a reasonable estimate for KT, the true number of clusters. We provide arguments to justify the steps in the stages of our methods and we provide examples involving simulated and published data to compare our technique with other techniques. An R library, GHC, implementing our method is available from Github.

AB - Here, we propose a clustering technique for general clustering problems including those that have nonconvex clusters. For a given desired number of clusters K, we use three stages to find clusters. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). The key step in this stage is to find a K-means clustering using (Formula presented.) clusters where (Formula presented.) and then join these small clusters by using single linkage clustering. The second stage stabilizes the result of stage one by reclustering via the “membership matrix” under Hamming distance to generate a dendrogram. The third stage is to cut the dendrogram to get (Formula presented.) clusters where (Formula presented.) and then prune back to K to give a final clustering. A variant on our technique also gives a reasonable estimate for KT, the true number of clusters. We provide arguments to justify the steps in the stages of our methods and we provide examples involving simulated and published data to compare our technique with other techniques. An R library, GHC, implementing our method is available from Github.

KW - Consistency

KW - K-mean

KW - Lifetime

KW - Outlier

KW - Single linkage

KW - t-SNE

UR - http://www.scopus.com/inward/record.url?scp=85073203174&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85073203174&partnerID=8YFLogxK

U2 - 10.1080/10618600.2018.1546593

DO - 10.1080/10618600.2018.1546593

M3 - Article

AN - SCOPUS:85073203174

VL - 28

SP - 540

EP - 551

JO - Journal of Computational and Graphical Statistics

JF - Journal of Computational and Graphical Statistics

SN - 1061-8600

IS - 3

ER -