Genetic algorithm classifier system for semi-supervised learning

L. Dee Miller, Leen-Kiat Soh, Stephen Scott

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Real-world datasets often contain large numbers of unlabeled data points, because there is additional cost for obtaining the labels. Semi-supervised learning (SSL) algorithms use both labeled and unlabeled data points for training that can result in higher classification accuracy on these datasets. Generally, traditional SSLs tentatively label the unlabeled data points on the basis of the smoothness assumption that neighboring points should have the same label. When this assumption is violated, unlabeled points are mislabeled injecting noise into the final classifier. An alternative SSL approach is cluster-then-label (CTL), which partitions all the data points (labeled and unlabeled) into clusters and creates a classifier by using those clusters. CTL is based on the less restrictive cluster assumption that data points in the same cluster should have the same label. As shown, this allows CTLs to achieve higher classification accuracy on many datasets where the cluster assumption holds for the CTLs, but smoothness does not hold for the traditional SSLs. However, cluster configuration problems (e.g., irrelevant features, insufficient clusters, and incorrectly shaped clusters) could violate the cluster assumption. We propose a new framework for CTLs by using a genetic algorithm (GA) to evolve classifiers without the cluster configuration problems (e.g., the GA removes irrelevant attributes, updates number of clusters, and changes the shape of the clusters). We demonstrate that a CTL based on this framework achieves comparable or higher accuracy with both traditional SSLs and CTLs on 12 University of California, Irvine machine learning datasets.

Original languageEnglish (US)
Pages (from-to)201-232
Number of pages32
JournalComputational Intelligence
Volume31
Issue number2
DOIs
StatePublished - May 1 2015

Fingerprint

Semi-supervised Learning
Supervised learning
Labels
Classifiers
Genetic algorithms
Classifier
Genetic Algorithm
Learning algorithms
Learning systems
Smoothness
Configuration
Number of Clusters
Violate
Costs
Learning Algorithm

Keywords

  • cluster-then-label
  • genetic algorithm
  • semi-supervised learning
  • unsupervised clustering

ASJC Scopus subject areas

  • Computational Mathematics
  • Artificial Intelligence

Cite this

Genetic algorithm classifier system for semi-supervised learning. / Dee Miller, L.; Soh, Leen-Kiat; Scott, Stephen.

In: Computational Intelligence, Vol. 31, No. 2, 01.05.2015, p. 201-232.

Research output: Contribution to journalArticle

Dee Miller, L. ; Soh, Leen-Kiat ; Scott, Stephen. / Genetic algorithm classifier system for semi-supervised learning. In: Computational Intelligence. 2015 ; Vol. 31, No. 2. pp. 201-232.
@article{f6040609b1054084875a50f524034c3d,
title = "Genetic algorithm classifier system for semi-supervised learning",
abstract = "Real-world datasets often contain large numbers of unlabeled data points, because there is additional cost for obtaining the labels. Semi-supervised learning (SSL) algorithms use both labeled and unlabeled data points for training that can result in higher classification accuracy on these datasets. Generally, traditional SSLs tentatively label the unlabeled data points on the basis of the smoothness assumption that neighboring points should have the same label. When this assumption is violated, unlabeled points are mislabeled injecting noise into the final classifier. An alternative SSL approach is cluster-then-label (CTL), which partitions all the data points (labeled and unlabeled) into clusters and creates a classifier by using those clusters. CTL is based on the less restrictive cluster assumption that data points in the same cluster should have the same label. As shown, this allows CTLs to achieve higher classification accuracy on many datasets where the cluster assumption holds for the CTLs, but smoothness does not hold for the traditional SSLs. However, cluster configuration problems (e.g., irrelevant features, insufficient clusters, and incorrectly shaped clusters) could violate the cluster assumption. We propose a new framework for CTLs by using a genetic algorithm (GA) to evolve classifiers without the cluster configuration problems (e.g., the GA removes irrelevant attributes, updates number of clusters, and changes the shape of the clusters). We demonstrate that a CTL based on this framework achieves comparable or higher accuracy with both traditional SSLs and CTLs on 12 University of California, Irvine machine learning datasets.",
keywords = "cluster-then-label, genetic algorithm, semi-supervised learning, unsupervised clustering",
author = "{Dee Miller}, L. and Leen-Kiat Soh and Stephen Scott",
year = "2015",
month = "5",
day = "1",
doi = "10.1111/coin.12018",
language = "English (US)",
volume = "31",
pages = "201--232",
journal = "Computational Intelligence",
issn = "0824-7935",
publisher = "Wiley-Blackwell",
number = "2",

}

TY - JOUR

T1 - Genetic algorithm classifier system for semi-supervised learning

AU - Dee Miller, L.

AU - Soh, Leen-Kiat

AU - Scott, Stephen

PY - 2015/5/1

Y1 - 2015/5/1

N2 - Real-world datasets often contain large numbers of unlabeled data points, because there is additional cost for obtaining the labels. Semi-supervised learning (SSL) algorithms use both labeled and unlabeled data points for training that can result in higher classification accuracy on these datasets. Generally, traditional SSLs tentatively label the unlabeled data points on the basis of the smoothness assumption that neighboring points should have the same label. When this assumption is violated, unlabeled points are mislabeled injecting noise into the final classifier. An alternative SSL approach is cluster-then-label (CTL), which partitions all the data points (labeled and unlabeled) into clusters and creates a classifier by using those clusters. CTL is based on the less restrictive cluster assumption that data points in the same cluster should have the same label. As shown, this allows CTLs to achieve higher classification accuracy on many datasets where the cluster assumption holds for the CTLs, but smoothness does not hold for the traditional SSLs. However, cluster configuration problems (e.g., irrelevant features, insufficient clusters, and incorrectly shaped clusters) could violate the cluster assumption. We propose a new framework for CTLs by using a genetic algorithm (GA) to evolve classifiers without the cluster configuration problems (e.g., the GA removes irrelevant attributes, updates number of clusters, and changes the shape of the clusters). We demonstrate that a CTL based on this framework achieves comparable or higher accuracy with both traditional SSLs and CTLs on 12 University of California, Irvine machine learning datasets.

AB - Real-world datasets often contain large numbers of unlabeled data points, because there is additional cost for obtaining the labels. Semi-supervised learning (SSL) algorithms use both labeled and unlabeled data points for training that can result in higher classification accuracy on these datasets. Generally, traditional SSLs tentatively label the unlabeled data points on the basis of the smoothness assumption that neighboring points should have the same label. When this assumption is violated, unlabeled points are mislabeled injecting noise into the final classifier. An alternative SSL approach is cluster-then-label (CTL), which partitions all the data points (labeled and unlabeled) into clusters and creates a classifier by using those clusters. CTL is based on the less restrictive cluster assumption that data points in the same cluster should have the same label. As shown, this allows CTLs to achieve higher classification accuracy on many datasets where the cluster assumption holds for the CTLs, but smoothness does not hold for the traditional SSLs. However, cluster configuration problems (e.g., irrelevant features, insufficient clusters, and incorrectly shaped clusters) could violate the cluster assumption. We propose a new framework for CTLs by using a genetic algorithm (GA) to evolve classifiers without the cluster configuration problems (e.g., the GA removes irrelevant attributes, updates number of clusters, and changes the shape of the clusters). We demonstrate that a CTL based on this framework achieves comparable or higher accuracy with both traditional SSLs and CTLs on 12 University of California, Irvine machine learning datasets.

KW - cluster-then-label

KW - genetic algorithm

KW - semi-supervised learning

KW - unsupervised clustering

UR - http://www.scopus.com/inward/record.url?scp=84929048883&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929048883&partnerID=8YFLogxK

U2 - 10.1111/coin.12018

DO - 10.1111/coin.12018

M3 - Article

AN - SCOPUS:84929048883

VL - 31

SP - 201

EP - 232

JO - Computational Intelligence

JF - Computational Intelligence

SN - 0824-7935

IS - 2

ER -