Many popular data cleaning approaches are rule-based: Constraints are formulated in a logical framework, and data is considered dirty if constraints are violated.These constraints are often discovered from data, but to ascertain their validity, user verification is necessary. Since the full set of discovered constraints is typically too large for manual inspection, recent research integrates user feedback into the discovery process. We propose a different approach that employs user interaction only at the start of the algorithm: a user manually cleans a small set of dirty tuples, and we infer the constraint underlying those repairs, called an explanation. We make use of conditional functional dependencies (CFDs) as the constraint formalism. We introduce XPlode, an ondemand algorithm which discovers the best explanation for a given repair. Guided by this explanation, data can then be cleaned using state-of-the-art CFD-based cleaning algorithms. Experiments on synthetic and real-world datasets show that the best explanation can typically be inferred using a limited number of modifications. Moreover, XPlode is substantially faster than discovering all CFDs that hold on a dataset, and is robust to noise in the modifications.
Many techniques for cleaning dirty data are based on enforcing some set of integrity constraints. Conditional functional dependencies (CFDs) are a combination of traditional Functional dependencies (FDs) and association rules, and are widely used as a constraint formalism for data cleaning. However, the discovery of such CFDs has received limited attention. In this paper, we regard CFDs as an extension of association rules, and present three general methodologies for (approximate) CFD discovery, each using a different way of combining pattern mining for discovering the conditions (the "C" in CFD) with FD discovery. We discuss how existing algorithms fit into these three methodologies, and introduce new techniques to improve the discovery process. We show that the right choice of methodology improves performance over the traditional CFD discovery method CTane.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.