Personal tools
You are here: Home Events Lab Lunch: Nan Tang

Lab Lunch: Nan Tang

Interaction between Record Matching and Data Repairing

When Aug 09, 2011
from 01:00 PM to 02:00 PM
Where Informatics Forum MF2
Add event to calendar vCal

Nan Tang will be the speaker at the LFCS Lab Lunch.

Title:   Interaction between Record Matching and Data Repairing


Abstract:  Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using constraints. These are treated as separate processes in current data cleaning systems, based on

heuristic solutions. We studied a new problem, namely, the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we propose a uniform framework that seamlessly unifies repairing and matching operations, to clean a database based on integrity constraints, matching rules and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP or coNP-complete, to PSPACE-complete.

Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find deterministic fixes and reliable fixes based on confidence and entropy analysis, respectively, which are more accurate than possible fixes generated by heuristics. We experimentally verify that our techniques significantly improve the accuracy of record matching and data repairing taken as separate processes, using real-life data.


Document Actions