FPCluster: An Efficient Out-of-core Clustering Strategy without a Similarity Metric

Authors

  • Douglas E.V. Pires Universidade Federal de Minas Gerais
  • Luam C. Totti Universidade Federal de Minas Gerais
  • Rubens E.A. Moreira Universidade Federal de Minas Gerais
  • Elverton C. Fazzion Universidade Federal de Minas Gerais
  • Osvaldo L.H.M. Fonseca Universidade Federal de Minas Gerais
  • Wagner Meira Jr Universidade Federal de Minas Gerais
  • Raquel C. de Melo-Minardi Universidade Federal de Minas Gerais
  • Dorgival Guedes Neto Universidade Federal de Minas Gerais

Keywords:

Clustering, out-of-core, protein families, spam detection

Abstract

Clustering is one of the most popular and relevant data mining tasks. Two challenges for determining clusters arethe volume of data to be grouped and the difficulty in defining a similarity metric applicable to the entire data set. In this work we present FPCluster, a new clustering algorithm that addresses both problems. The algorithm is based on building out-of-core frequent pattern trees, a data structure originally proposed for mining patterns. Additionally, the algorithm transparently handles missing features, a common constraint in real case scenarios. We applied FPCluster to two real scenarios: characterization of spam campaigns and clustering of protein families. We evaluated both the quality of the obtained groups and the computational efficiency of the proposed strategy. In particular, we achieved precision above 90% while the storage demand increased sub-linearly. 

Downloads

Additional Files

Published

2012-09-20