Early Classification: A New Heuristic to Improve the Classification Step of K-Means

Authors

  • Joaquín Pérez CENIDET
  • Carlos Eduardo Pires UFCG
  • Leandro Balby UFCG
  • Adriana Mexicano CENIDET
  • Miguel Ángel Hidalgo CENIDET

Keywords:

Clustering, K-Means, Performance Optimization, Unsupervised Learning

Abstract

Cluster analysis is the study of algorithms and techniques for grouping objects according to their intrinsic characteristics and similarity. A widely studied and popular clustering algorithm is K-Means, which is characterized by its ease of implementation and high computational cost. Although various performance improvements have been proposed for K-Means, the algorithm is still considered an expensive alternative for clustering large scale datasets. This work proposes a new heuristic for reducing the number of calculations needed in the classification step of K-Means, without high loss of quality reduction, by using statistical information about the displacement of centroids at each iteration. Our heuristic, called Early Classification (EC for short), identifies and excludes from future calculations those objects that, according to an equidistance threshold, have low likelihood of cluster change in subsequent iterations. To validate our proposal, a set of experiments is performed on synthetic and real-world datasets from the UCI Machine Learning repository. In addition, we compared our heuristic against two other state-of-the-art variations of K-Means. The Wilcoxon signed-rank test was used for calculating the statistical significance. The results are promising since the execution time of K-Means was reduced up to 82.49%, with a quality reduction of only 3.31%. Moreover, as the experiments will show, the superiority of our method is even more evident on large datasets.

Downloads

Published

2013-08-21