Abstract. We address the problem of scalable distributed reasoning, proposing a technique for materialising the closure of an RDF graph based on MapReduce. We have implemented our approach on top of Hadoop and deployed it on a compute cluster of up to 64 commodity machines. We show that a naive implementation on top of MapReduce is straightforward but performs badly and we present several non-trivial optimisations. Our algorithm is scalable and allows us to compute the RDFS closure of 865M triples from the Web (producing 30B triples) in less than two hours, faster than any other published approach.
Abstract. In previous work we have shown that the MapReduce framework for distributed computation can be deployed for highly scalable inference over RDF graphs under the RDF Schema semantics. Unfortunately, several key optimizations that enabled the scalable RDFS inference do not generalize to the richer OWL semantics. In this paper we analyze these problems, and we propose solutions to overcome them. Our solutions allow distributed computation of the closure of an RDF graph under the OWL Horst semantics.We demonstrate the WebPIE inference engine, built on top of the Hadoop platform and deployed on a compute cluster of 64 machines. We have evaluated our approach using some real-world datasets (UniProt and LDSR, about 0.9-1.5 billion triples) and a synthetic benchmark (LUBM, up to 100 billion triples). Results show that our implementation is scalable and vastly outperforms current systems when comparing supported language expressivity, maximum data size and inference speed.
The Semantic Web contains many billions of statements, which are released using the resource description framework (RDF) data model. To better handle these large amounts of data, high performance RDF applications must apply a compression technique. Unfortunately, because of the large input size, even this compression is challenging. In this paper, we propose a set of distributed MapReduce algorithms to efficiently compress and decompress a large amount of RDF data. Our approach uses a dictionary encoding technique that maintains the structure of the data. We highlight the problems of distributed data compression and describe the solutions that we propose. We have implemented a prototype using the Hadoop framework, and evaluate its performance. We show that our approach is able to efficiently compress a large amount of data and scales linearly on both input size and number of nodes. SCALABLE RDF DATA COMPRESSION WITH MAPREDUCE 25 make dictionary encoding a feasible technique on a very large input, a distributed implementation is required. To the best of our knowledge, no distributed approach exists to solve this problem.In this paper, we propose a technique to compress and decompress RDF statements using the MapReduce programming model [6]. Our approach uses a dictionary encoding technique that maintains the original structure of the data. This technique can be used by all RDF applications that need to efficiently process a large amount of data, such as RDF storage engines, network analysis tools, and reasoners.Our compression technique was essential in our recent work on Semantic Web inference engines, as it allowed us to reason directly on the compressed statements with a consequent increase of performance. As a result, we were able to reason over tens of billions of statements [7,8], advancing the current state of the art in the field significantly.The compression technique we present in this paper has the following: (i) performance that scales linearly; (ii) the ability to build a very large dictionary of hundreds of millions of entries; and (iii) the ability to handle load balancing issues with sampling and caching.This paper is structured as follows. In Section 2, we discuss the conventional approach to dictionary encoding and highlight the problems that arise. Sections 3 and 4 describe how we have implemented the data compression and decompression in MapReduce. Section 5 evaluates our approach, and Section 6 describes related work. Finally, we conclude and discuss future work in Section 7. DICTIONARY ENCODINGDictionary encoding is often used because of its simplicity. In our case, dictionary encoding has also the additional advantage that the compressed data can still be manipulated by the application. Traditional techniques such as gzip or bzip2 hide the original data so that reading without decompression is impossible. Algorithm 1 shows a sequential algorithm to compress and decompress RDF statements. The compression algorithm starts by initializing the dictionary table. The table has two columns, one tha...
Abstract. Both materialization and backward-chaining as different modes of performing inference have complementary advantages and disadvantages.Materialization enables very efficient responses at query time, but at the cost of an expensive up front closure computation, which needs to be redone every time the knowledge base changes. Backward-chaining does not need such an expensive and change-sensitive precomputation, and is therefore suitable for more frequently changing knowledge bases, but has to perform more computation at query time.Materialization has been studied extensively in the recent semantic web literature, and is now available in industrial-strength systems. In this work, we focus instead on backward-chaining, and we present an hybrid algorithm to perform efficient backward-chaining reasoning on very large datasets expressed in the OWL Horst (pD * ) fragment.As a proof of concept, we have implemented a prototype called QueryPIE (Query Parallel Inference Engine), and we have tested its performance on different datasets of up to 1 billion triples. Our parallel implementation greatly reduces the reasoning complexity of a naive backward-chaining approach and returns results for single query-patterns in the order of milliseconds when running on a modest 8 machine cluster.To the best of our knowledge, QueryPIE is the first reported backward-chaining reasoner for OWL Horst that efficiently scales to a billion triples.
a b s t r a c tIn the last few years a new research area, called stream reasoning, emerged to bridge the gap between reasoning and stream processing. While current reasoning approaches are designed to work on mainly static data, the Web is, on the other hand, extremely dynamic: information is frequently changed and updated, and new data is continuously generated from a huge number of sources, often at high rate. In other words, fresh information is constantly made available in the form of streams of new data and updates.Despite some promising investigations in the area, stream reasoning is still in its infancy, both from the perspective of models and theories development, and from the perspective of systems and tools design and implementation.The aim of this paper is threefold: (i) we identify the requirements coming from different application scenarios, and we isolate the problems they pose; (ii) we survey existing approaches and proposals in the area of stream reasoning, highlighting their strengths and limitations; (iii) we draw a research agenda to guide the future research and development of stream reasoning. In doing so, we also analyze related research fields to extract algorithms, models, techniques, and solutions that could be useful in the area of stream reasoning.
a b s t r a c tIn the last few years a new research area, called stream reasoning, emerged to bridge the gap between reasoning and stream processing. While current reasoning approaches are designed to work on mainly static data, the Web is, on the other hand, extremely dynamic: information is frequently changed and updated, and new data is continuously generated from a huge number of sources, often at high rate. In other words, fresh information is constantly made available in the form of streams of new data and updates.Despite some promising investigations in the area, stream reasoning is still in its infancy, both from the perspective of models and theories development, and from the perspective of systems and tools design and implementation.The aim of this paper is threefold: (i) we identify the requirements coming from different application scenarios, and we isolate the problems they pose; (ii) we survey existing approaches and proposals in the area of stream reasoning, highlighting their strengths and limitations; (iii) we draw a research agenda to guide the future research and development of stream reasoning. In doing so, we also analyze related research fields to extract algorithms, models, techniques, and solutions that could be useful in the area of stream reasoning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.