# Lab Lunch talk by Wenyuan Yu

Incremental Detection of Inconsistencies in Distributed Data

This paper investigates the problem of incremental
detection of errors in distributed data. Given a distributed
database D, a set Σ of conditional functional dependencies
(CFDs), the set V of violations of the CFDs in D, and updates
∆D to D, it is to ﬁnd, with minimum data shipment, changes
∆V to V in response to ∆D. The need for the study is evident
since real-life data is often dirty, distributed and is frequently
updated. It is often prohibitively expensive to recompute the
entire set of violations when D is updated. We show that the
incremental detection problem is NP-complete for D partitioned
either vertically or horizontally, even when Σ and D are ﬁxed.
Nevertheless, we show that it is bounded and better still, actually
optimal: there exist algorithms to detect errors such that their
computational cost and data shipment are both linear in the size
of ∆D and ∆V, independent of the size of the database D. We
provide such incremental algorithms for vertically partitioned
data, and show that the algorithms are optimal. We further
propose optimization techniques for the incremental algorithm
over vertical partitions to reduce data shipment. We verify
experimentally, using real-life data on Amazon Elastic Compute
Cloud (EC2), that our algorithms substantially outperform their
batch counterparts even when ∆V is reasonably large.