Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e., detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system. Figure 1: Architecture of NADEEF Deployment and extensibility: Although many algorithms and techniques have been proposed for data clean ing [5, 14, 29], it is difficult to download one of them and run it on the data at hand without tedious customization. Adding to this difficulty is when users define new types of quality rules, or want to extend an existing system with their own implementation of cleaning solutions. Metadata management and data custodians: Data is not born an orphan. Real customers have little trust in the machines to mess with the data without human consultation. Several attempts have tackled the problem of including humans in the loop (e.g., [15,26,29]). However, they only provide users with information in restrictive formats. In practice, the users need to understand much more metainformation e.g., summarization or samples of data errors, lineage of data changes, and possible data repairs, before they can effectively guide any data cleaning process.We introduce NADEEF, a prototype for an extensible and easy-to-deploy cleaning system that leverages the separability of two main tasks: (1) isolating rule specification that uniformly defines what is wrong and (possibly) how to fix it; and (2) developing a core that holistically applies these routines to handle the detection and cleaning of data errors.We show NADEEF's architecture i...
No abstract
We present NADEEF, an extensible, generic and easy-to-deploy data cleaning system. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows users to specify data quality rules by writing code that implements predefined classes. These classes uniformly define what is wrong with the data and (possibly) how to fix it. We will demonstrate the following features provided by NADEEF. (1) Heterogeneity: The programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. (2) Interdependency: The core algorithms can interleave multiple types of rules to detect and repair data errors. (3) Deployment and extensibility: Users can easily customize NADEEF by defining new types of rules, or by extending the core. (4) Metadata management and data custodians: We show a live data quality dashboard to effectively involve users in the data cleaning process.
Land change (LC) models are dedicated to a better understanding of land use and land cover dynamics. A fundamental aspect of those models lies in the calibration of spatial parameters underlying such dynamics. Although there are many studies on the calibration of LC models, current efforts have a common goal of seeking to find a single global optimum solution, even though land change dynamics may be inherently heterogeneous throughout a given space. This article presents a calibration approach for finding multiple optimal solutions. A crowding niching genetic algorithm (CNGA) is incorporated into a cellular automata LC model. The model is applied to simulate urban expansion in Wallonia (Belgium) as a case study. Our findings demonstrate the ability of the model to locate multiple solutions simultaneously. In addition, the CNGA performs better than the standard genetic algorithm—besides, the CNGA helps to better understand the properties of land change dynamics within a given landscape.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.