Abstract-Linked open data (LOD), as provided by a quickly growing number of sources constitutes a wealth of easily accessible information. However, this data is not easy to understand. It is usually provided as a set of (RDF) triples, often enough in the form of enormous files covering many domains. What is more, the data usually has a loose structure when it is derived from end-user generated sources, such as Wikipedia. Finally, the quality of the actual data is also worrisome, because it may be incomplete, poorly formatted, inconsistent, etc.To understand and profile such linked open data, traditional data profiling methods do not suffice. With ProLOD, we propose a suite of methods ranging from the domain level (clustering, labeling), via the schema level (matching, disambiguation), to the data level (data type detection, pattern detection, value distribution). Packaged into an interactive, web-based tool, they allow iterative exploration and discovery of new LOD sources. Thus, users can quickly gauge the relevance of the source for the problem at hand (e.g., some integration task), focus on and explore the relevant subset.
I. PROFILING LINKED OPEN DATAData profiling comprises a well established set of basic operations, which analyze a (relational) dataset and create metadata that is useful to understand the data and to detect irregularities. Profiling is mostly performed in a column-by-column manner, for instance to detect frequent value patterns or the uniqueness of column values. Common profiling methods and tools have the underlying assumption of a well-defined semantics of the column and mostly regular data.These assumptions do not hold for linked open data (LOD) published on the web. Such data emerge from different sources, such as open source communities (e.g., Wikipedia) or projects dedicated to a specific topic (e.g., DrugBank [1]). These diverse origins cause a diversity of how information is expressed as data values and how these values are structured. Nevertheless, these datasets interlink each other. The overall LOD vision is to enable the generation of new knowledge based on a wealth of widely available interlinked data. However, leveraging the variety of such open data requires (i) an initial understanding of each single dataset and (ii) an overview of the available data as a whole. Only then, data analysts can focus on the required subset of LOD for the problem at hand. Classical profiling techniques are, to the best of our knowledge, not appropriate to deal with these new massive sets of open (and thus heterogeneous) data. We propose a new iterative and interactive methodology for profiling LOD. We envision a process that allows a user to divide data into groups, review simple statistics or sophisticated mining results on a group-level, and then rethink grouping decisions in order to revise them for refining the profiling result. In this paper, we report on ProLOD, an initial prototype we developed to step towards this vision. As a proof-of-concept, we concentrate on the infobox (without ontol...