Abstract:This paper reports on the results of an independent evaluation of the techniques presented in the VLDB 2007 paper "Scalable Semantic Web Data Management Using Vertical Partitioning", authored by D. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach [1]. We revisit the proposed benchmark and examine both the data and query space coverage. The benchmark is extended to cover a larger portion of the query space in a canonical way. Repeatability of the experiments is assessed using the code base obtained from the au… Show more
“…The number of self-joins in the plan corresponds to the number of properties co-located in a table. The phenomenon is reminiscent of the debate concerning the use of row-stores vs. column stores [12,21,44,47]. Consideration of rowstores vs. column-stores is outside the scope of this paper.…”
Section: Self-join Eliminationmentioning
confidence: 99%
“…Consideration of rowstores vs. column-stores is outside the scope of this paper. Nevertheless, we note that there is debate within the community on the use of row-stores or column-stores for native RDF data and our measurements may help ground that debate [12,21,44,47].…”
Abstract:The Semantic Web's promise to achieve web-wide data integration requires the inclusion of legacy relational data as RDF, which, in turn, requires the execution of SPARQL queries on the legacy relational database. In this paper we explore a hypothesis: existing commercial relational databases already subsume the algorithms and optimizations needed to support effective SPARQL execution on existing relationally stored data. The experiment, embodied in a system called Ultrawrap, comprises encoding a logical representation of the database as a graph using SQL views and a simple syntactic translation of SPARQL queries to SQL queries on those views. Thus, in the course executing a SPARQL query, the SQL optimizer both instantiates a mapping of relational data to RDF and optimizes its execution. Other approaches typically implement aspects of query optimization and execution outside the SQL environment.Ultrawrap is evaluated using two benchmarks across the three major relational database management systems. We identify two important optimizations: detection of unsatisfiable conditions and self-join elimination, such that, when applied, SPARQL queries execute at nearly the same speed as semantically equivalent native SQL queries, providing strong evidence of the validity of the hypothesis.
“…The number of self-joins in the plan corresponds to the number of properties co-located in a table. The phenomenon is reminiscent of the debate concerning the use of row-stores vs. column stores [12,21,44,47]. Consideration of rowstores vs. column-stores is outside the scope of this paper.…”
Section: Self-join Eliminationmentioning
confidence: 99%
“…Consideration of rowstores vs. column-stores is outside the scope of this paper. Nevertheless, we note that there is debate within the community on the use of row-stores or column-stores for native RDF data and our measurements may help ground that debate [12,21,44,47].…”
Abstract:The Semantic Web's promise to achieve web-wide data integration requires the inclusion of legacy relational data as RDF, which, in turn, requires the execution of SPARQL queries on the legacy relational database. In this paper we explore a hypothesis: existing commercial relational databases already subsume the algorithms and optimizations needed to support effective SPARQL execution on existing relationally stored data. The experiment, embodied in a system called Ultrawrap, comprises encoding a logical representation of the database as a graph using SQL views and a simple syntactic translation of SPARQL queries to SQL queries on those views. Thus, in the course executing a SPARQL query, the SQL optimizer both instantiates a mapping of relational data to RDF and optimizes its execution. Other approaches typically implement aspects of query optimization and execution outside the SQL environment.Ultrawrap is evaluated using two benchmarks across the three major relational database management systems. We identify two important optimizations: detection of unsatisfiable conditions and self-join elimination, such that, when applied, SPARQL queries execute at nearly the same speed as semantically equivalent native SQL queries, providing strong evidence of the validity of the hypothesis.
“…Our attention also catch column-oriented databases [7,95], which can be customized for provenance management, and provenance reduction techniques [28], which can be used to decrease storage requirements via duplicate elimination and provenance inheritance. Finally, we would like to consider querying and managing scientific workflow provenance in distributed environments with multiple computing nodes to enable processing of huge datasets with billions of triples.…”
Provenance metadata has become increasingly important to support scientific discovery reproducibility, result interpretation, and problem diagnosis in scientific workflow environments. The provenance management problem concerns the efficiency and effectiveness of the modeling, recording, representation, integration, storage, and querying of provenance metadata. Our approach to provenance management seamlessly integrates the interoperability, extensibility, and inference advantages of Semantic Web technologies with the storage and querying power of an RDBMS to meet the emerging requirements of scientific workflow provenance management. In this paper, we elaborate on the design of a relational RDF store, called RDFPROV, which is optimized for scientific workflow provenance querying and management. Specifically, we propose: i) two schema mapping algorithms to map an OWL provenance ontology to a relational database schema that is optimized for common provenance queries; ii) three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema, and iii) a schema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes of the tables in the database. Experimental results are presented to show that our algorithms are efficient and scalable. The comparison with two popular relational RDF stores, Jena and Sesame, and two commercial native RDF stores, AllegroGraph and BigOWLIM, showed that our optimizations result in improved performance and scalability for provenance metadata management. Finally, our case study for provenance management in a real-life biological simulation workflow showed the production quality and capability of the RDFPROV system. Although presented in the context of scientific workflow provenance management, many of our proposed techniques apply to general RDF data management as well.
“…In addition, other researchers can exploit this available software artifacts as a valuable starting point to evaluate and assess the significance of their own proposed contribution. One of the interesting examples for the value of such independent evaluation studies is the study of Sidirourgos et al [39] where they have reported about an independent assessment of the published result by Abadi et al in [4] which described an approach for implementing a vertically partitioned DBMS for Semantic Web data management. The outcomes of this independent assessment revealed many interesting aspects.…”
Section: Benchmarking Challenges In Computer Sciencementioning
confidence: 99%
“…For instance, in [4] Abadi et al reported that the performance of binary tables is superior to that of the clustered property table for processing RDF queries while Sidirourgos et al [39] reported that even in columnstore database, the performance of binary tables is not always better than clustered property table and depends on the characteristics of the data set. In addition, the experiments of [4] reported that storing RDF data in column-store database is better than that of row-store database while [39] experiments have shown that the gain of performance in column-store database depends on the number of predicates in a data set. A main lesson from this example is that we cannot really be sure that published research results are accurate and comprehensive even if they were reported by the best scientists and went through the most rigorous peer review process.…”
Section: Benchmarking Challenges In Computer Sciencementioning
Abstract-Performances evaluation, benchmarking and reproducibility represent significant aspects for evaluating the practical impact of scientific research outcomes in the Computer Science field. In spite of all the benefits (e.g., increasing visibility, boosting impact, improving the research quality) which can be obtained from conducting comprehensive and extensive experimental evaluations or providing reproducible software artifacts and detailed description of experimental setup, the required effort for achieving these goals remains prohibitive. In this article, we present the design and the implementation details of the Liquid Benchmarking platform as a social and cloud-based platform for democratizing and socializing the software benchmarking processes. Particularly, the platform facilitates the process of sharing the experimental artifacts (computing resources, datasets, software implementations, benchmarking tasks) as services where the end users can easily design, mashup, execute the experiments and visualize the experimental results with zero installation or configuration efforts. Moreover, the social features of the platform enable the users to share and provide feedback on the results of the executed experiments in a form that can guarantee a transparent scientific crediting process. Finally, we present four benchmarking case studies that have been realized via the Liquid Benchmarking platform in the following domains: XML compression techniques, graph indexing and querying techniques, string similarity join algorithms and reverse K nearest neighbors algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.