FPCluster: An Efficient Out-of-core Clustering Strategy without a Similarity Metric
Keywords:
Clustering, out-of-core, protein families, spam detectionAbstract
Clustering is one of the most popular and relevant data mining tasks. Two challenges for determining clusters arethe volume of data to be grouped and the difficulty in defining a similarity metric applicable to the entire data set. In this work we present FPCluster, a new clustering algorithm that addresses both problems. The algorithm is based on building out-of-core frequent pattern trees, a data structure originally proposed for mining patterns. Additionally, the algorithm transparently handles missing features, a common constraint in real case scenarios. We applied FPCluster to two real scenarios: characterization of spam campaigns and clustering of protein families. We evaluated both the quality of the obtained groups and the computational efficiency of the proposed strategy. In particular, we achieved precision above 90% while the storage demand increased sub-linearly.Downloads
Additional Files
Published
2012-09-20
Issue
Section
SBBD 2011 Short Papers