DistLODStats: Distributed Computation of RDF Dataset Statistics

Sejdiu, Gëzim; Ermilov, Ivan; Lehmann, Jens; Mami, Mohamed Nadjib

doi:10.1007/978-3-030-00668-6_13

Cited by 8 publications

(9 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experimental setup of ABSTAT-HD is in line with the setup used in the only other approach proposed in the stateof-the-art to distribute the computation of knowledge graph profiling, namely, DistLODStat [43], where the scalability of the distributed and centralized version of the same systems are compared.…”

Section: Methodsmentioning

confidence: 99%

ABSTAT-HD: a scalable tool for profiling very large knowledge graphs

et al. 2021

View full text Add to dashboard Cite

Processing large-scale and highly interconnected Knowledge Graphs (KG) is becoming crucial for many applications such as recommender systems, question answering, etc. Profiling approaches have been proposed to summarize large KGs with the aim to produce concise and meaningful representation so that they can be easily managed. However, constructing profiles and calculating several statistics such as cardinality descriptors or inferences are resource expensive. In this paper, we present ABSTAT-HD, a highly distributed profiling tool that supports users in profiling and understanding big and complex knowledge graphs. We demonstrate the impact of the new architecture of ABSTAT-HD by presenting a set of experiments that show its scalability with respect to three dimensions of the data to be processed: size, complexity and workload. The experimentation shows that our profiling framework provides informative and concise profiles, and can process and manage very large KGs.

show abstract

Section: Methodsmentioning

confidence: 99%

ABSTAT-HD: a scalable tool for profiling very large knowledge graphs

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Schmachtenberg et al [32] present the status of RDF datasets in the LOD Cloud in terms of size, linking, vocabulary usage, and metadata. LODStats [13] and the large-scale approach DistLODStats [33] report on descriptive statistics about RDF datasets on the web, including the number of triples, RDF terms, properties per entity, and usage of vocabularies across datasets. ExpLOD [25] generates summaries and aggregated statistics about the structure of RDF graphs, e.g., sets of used properties or the number of instances per class.…”

Section: Rdf-specific Analysesmentioning

confidence: 99%

Charaterizing RDF graphs through graph-based measures – framework and assessment

Zloch

Acosta

Hienert

et al. 2021

View full text Add to dashboard Cite

The topological structure of RDF graphs inherently differs from other types of graphs, like social graphs, due to the pervasive existence of hierarchical relations (TBox), which complement transversal relations (ABox). Graph measures capture such particularities through descriptive statistics. Besides the classical set of measures established in the field of network analysis, such as size and volume of the graph or the type of degree distribution of its vertices, there has been some effort to define measures that capture some of the aforementioned particularities RDF graphs adhere to. However, some of them are redundant, computationally expensive, and not meaningful enough to describe RDF graphs. In particular, it is not clear which of them are efficient metrics to capture specific distinguishing characteristics of datasets in different knowledge domains (e.g., Cross Domain vs. Linguistics). In this work, we address the problem of identifying a minimal set of measures that is efficient, essential (non-redundant), and meaningful. Based on 54 measures and a sample of 280 graphs of nine knowledge domains from the Linked Open Data Cloud, we identify an essential set of 13 measures, having the capacity to describe graphs concisely. These measures have the capacity to present the topological structures and differences of datasets in established knowledge domains.

show abstract

“…It parallelizes streaming and sorting techniques to efficiently process RDF data. More recent methods either use HDFS (LODOP [14]) or store the data in memory (DistLODStats [33] via Spark). Exact rewriting rules have also been proposed to optimize the execution of such queries with groupings and aggregates in RDF data [11].…”

Section: Related Workmentioning

confidence: 99%

“…However, the increase in volume that makes these indicators more necessary also makes them harder to compute. The most recent methods adopt distributed architectures [14,33] that centralize the data, and then execute the indicator queries on that centralized data repository. To compute the exact query result, these approaches thus require the materialization of the entire LOD cloud.…”

Section: Introductionmentioning

confidence: 99%

Anytime Large-Scale Analytics of Linked Open Data

Soulet

Suchanek

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Analytical queries are queries with numerical aggregators: computing the average number of objects per property, identifying the most frequent subjects, etc. Such queries are essential to monitor the quality and the content of the Linked Open Data (LOD) cloud. Many analytical queries cannot be executed directly on the SPARQL endpoints, because the fair use policy cuts off expensive queries. In this paper, we show how to rewrite such queries into a set of queries that each satisfy the fair use policy. We then show how to execute these queries in such a way that the result provably converges to the exact query answer. Our algorithm is an anytime algorithm, meaning that it can give intermediate approximate results at any time point. Our experiments show that the approach converges rapidly towards the exact solution, and that it can compute even complex indicators at the scale of the LOD cloud.

show abstract

DistLODStats: Distributed Computation of RDF Dataset Statistics

Cited by 8 publications

References 14 publications

ABSTAT-HD: a scalable tool for profiling very large knowledge graphs

ABSTAT-HD: a scalable tool for profiling very large knowledge graphs

Charaterizing RDF graphs through graph-based measures – framework and assessment

Anytime Large-Scale Analytics of Linked Open Data

Contact Info

Product

Resources

About