Adaptive and Flexible Blocking for Record Linkage Tasks
Abstract
In data integration tasks, records from a single dataset or from different sources must often be compared to identify records that represent the same real world entity. The cost of this search process for finding duplicate records grows quadratically as the number of records available in the data sources increases and, for this reason, direct approaches, such as comparing all record pairs, must be avoided. In this context, blocking methods are used to creategroups of records that are likely to correspond to the same real world entity, so that the deduplication can be applied to these blocs only. In the recent literature, machine learning processes are used to find the best blocking function, based on a combination of low cost rules, which define how to perform the record blocking. In this paper we present a new blocking method based on machine learning.
Different from other methods, our method is based on genetic
programming, allowing for the use of more flexible rules and a larger number of such rules for defining blocking functions, leading to a more effective process for the identification of duplicate records. Experimental results with real and synthetic data show that our method achieves over 95\% of correctness when generating block of potential duplicate.
Downloads
Published
2010-09-14
Issue
Section
Regular Articles