Before reaping the benefits of open data to add value to an organizations internal data, such new, external datasets must be analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data profiling, already difficult for large relational data sources, is even more challenging for RDF datasets, the preferred data model for linked open data.We present ProLOD++, a novel tool for various profiling and mining tasks to understand and ultimately improve open RDF data. ProLOD++ comprises various traditional data profiling tasks, adapted to the RDF data model. In addition, it features many specific profiling results for open data, such as schema discovery for user-generated attributes, association rule discovery to uncover synonymous predicates, and uniqueness discovery along ontology hierarchies. ProLOD++ is highly efficient, allowing interactive profiling for users interested in exploring the properties and structure of yet unknown datasets. I. PROFILING LINKED OPEN DATAAt the time of writing, Linked Open Data (LOD) as compiled in http://linkeddata.org comprised already more than 300 data sources including prominent examples, such as DBpedia, YAGO, and Freebase. A LOD dataset is usually represented in the Resource Description Framework (RDF) embodying an entity-relationship-graph or a set of triplified facts consisting of subjects, predicates, and objects. Most of the datasets are openly available and connected amongst each other via sameAs links between representations of same real-world entities. Hundreds more open RDF datasets are listed for instance at http://datahub.io.However, consuming LOD is not easy, because the sources are heterogeneous, often inconsistent, and lack often even basic metadata. One of the main reasons for this problem is that many of the data sources, such as DBpedia [7] or YAGO [13], have been extracted from unstructured data. Furthermore, a knowledge base usually evolves over time when more facts and entities are added and rigid schema and ontology definitions, hand-crafted at some point of time, lose validity over all entities of the dataset. Hence it is vital to thoroughly examine and understand each dataset, its structure, and its properties before usage.Manually inspecting datasets can achieve this goal only to a limited extent: algorithms and tools are needed that profile the dataset to retrieve relevant and interesting meta-data analyzing the entire dataset [14]. Indeed, there are many commercial tools, such as IBM's Information Analyzer, Microsoft's SQL Server Integration Services (SSIS), or Informatica's Data Explorer, and some research prototypes, such as [12], for profiling relational datasets. However all of these tool were designed to profile relational data. LOD which is represented in RDF data has a very different nature and calls for specific profiling and mining techniques. Current tools to work on RDF data are limited to graph visualization and editing: LODlive 1 is a browser-based tool to browse and search in RDF datasets. RDF Pro...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.