Abstract. Several data quality management (DQM) tasks like duplicate detection or consistency checking depend on domain specific knowledge. Many DQM approaches have potential for bringing together domain knowledge and DQM metadata. We provide an approach which uses this knowledge modeled in ontologies instead of aquiring that knowledge by cost-intensive interviews with domain-experts. These ontologies can directly be annotated with DQM specific metadata. With our approach a synergy effect can be achieved when modeling a domain ontology, e.g. for defining a shared vocabulary for improved interoperability, and performing DQM. We present five DQM applications which directly use knowledge provided by domain ontologies. These applications use the ontology structure itself to provide correction suggestions for invalid data, identify duplicates, and to store data quality annotations at schema and instance level.
Data cleaning focuses on the identification and removal of consistency constraint violations. Existing approaches only perform statistical repair operations, i.e. inserting average or default values. This results in consistent data, but these data have no similarity with the given inconsistent data anymore. The use of an ontology-based approach allows for the detection of semantically related context-aware correction suggestions. We define metrics that can be used to calculate the similarity of such correction suggestions. We introduce measures to identify semantic distances of concepts in ontologies. This ontology enables the detection of context-aware correction suggestions and the calculation of their similarity to the invalid tuple. These suggestions can be presented to end users in data cleaning environments. We introduce this approach in a cancer registry that collects data about cancer cases. We show how the proposed approach can support domain experts in the registry in data cleaning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.