Abstract. rdf dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually -but rarely-applied. Nevertheless, the root of the violations which often derive from the mappings that specify how the rdf dataset will be generated, is not identified. We suggest an incremental, iterative and uniform validation workflow for rdf datasets stemming originally from (semi-)structured data (e.g., csv, xml, json). In this work, we focus on assessing and improving their mappings. We incorporate (i) a test-driven approach for assessing the mappings instead of the rdf dataset itself, as mappings reflect how the dataset will be formed when generated; and (ii) perform semi-automatic mapping refinements based on the results of the quality assessment. The proposed workflow is applied to diverse cases, e.g., large, crowdsourced datasets such as dbpedia, or newly generated, such as iLastic. Our evaluation indicates the efficiency of our workflow, as it significantly improves the overall quality of an rdf dataset in the observed cases.
No abstract
Abstract. Effective, collaborative integration of software and big data engineering for Web-scale systems, is now a crucial technical and economic challenge. This requires new combined data and software engineering processes and tools. Semantic metadata standards and linked data principles, provide a technical grounding for such integrated systems given an appropriate model of the domain. In this paper we introduce the ALIGNED suite of ontologies specifically designed to model the information exchange needs of combined software and data engineering. These ontologies are deployed in web-scale, data-intensive, system development environments in both the commercial and academic domains. We exemplify the usage of the suite on a complex collaborative software and data engineering scenario from the legal information system domain.
Abstract. Linked Data datasets use interlinks to connect semantically similar resources across datasets. As datasets evolve, a resources locator can change which can cause interlinks that contain old resource locators, to no longer dereference and become invalid. Validating interlinks, through validating the resource locators within them, when a dataset has changed is important to ensure interlinks work as intended. In this paper we introduce the SPARQL Usage for Mapping Maintenance and Reuse (SUMMR) methodology. SUMMR is an approach for Mapping Maintenance and Reuse (MMR) that provides query templates which are based on standard SPARQL queries for MMR activities. This paper describes SUMMR and two experiments: a lab-based evaluation of SUMMR's mapping maintenance query templates and a deployment of SUMMR in the DBpedia v.2015-10 release to detect invalid interlinks. The labbased evaluation involved detecting interlinks that have become invalid, due to changes in resource locators and the repair of the invalid interlinks. The results show that the SUMMR templates and approach can be used to effectively detect and repair invalid interlinks. SUMMR's query template for discovering invalid interlinks was applied to the DBpedia v.2015-10 release, which discovered 53,418 invalid interlinks in that release.
The constantly growing amount of Linked Open Data (LOD) datasets constitutes the need for rich metadata descriptions, enabling users to discover, understand and process the available data. This metadata is often created, maintained and stored in diverse data repositories featuring disparate data models that are often unable to provide the metadata necessary to automatically process the datasets described. This paper proposes DataID, a best-practice for LOD dataset descriptions which utilize RDF files hosted together with the datasets, under the same domain. We are describing the data model, which is based on the widely used DCAT and VoID vocabularies, as well as supporting tools to create and publish DataIDs and use cases that show the benefits of providing semantically rich metadata for complex datasets. As a proof of concept, we generated a DataID for the DBpedia dataset, which we will present in the paper.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.