Provenance metadata has become increasingly important to support scientific discovery reproducibility, result interpretation, and problem diagnosis in scientific workflow environments. The provenance management problem concerns the efficiency and effectiveness of the modeling, recording, representation, integration, storage, and querying of provenance metadata. Our approach to provenance management seamlessly integrates the interoperability, extensibility, and inference advantages of Semantic Web technologies with the storage and querying power of an RDBMS to meet the emerging requirements of scientific workflow provenance management. In this paper, we elaborate on the design of a relational RDF store, called RDFPROV, which is optimized for scientific workflow provenance querying and management. Specifically, we propose: i) two schema mapping algorithms to map an OWL provenance ontology to a relational database schema that is optimized for common provenance queries; ii) three efficient data mapping algorithms to map provenance RDF metadata to relational data according to the generated relational database schema, and iii) a schema-independent SPARQL-to-SQL translation algorithm that is optimized on-the-fly by using the type information of an instance available from the input provenance ontology and the statistics of the sizes of the tables in the database. Experimental results are presented to show that our algorithms are efficient and scalable. The comparison with two popular relational RDF stores, Jena and Sesame, and two commercial native RDF stores, AllegroGraph and BigOWLIM, showed that our optimizations result in improved performance and scalability for provenance metadata management. Finally, our case study for provenance management in a real-life biological simulation workflow showed the production quality and capability of the RDFPROV system. Although presented in the context of scientific workflow provenance management, many of our proposed techniques apply to general RDF data management as well.
Various computing and data resources on the Web are being enhanced with
machine-interpretable semantic descriptions to facilitate better search,
discovery and integration. This interconnected metadata constitutes the
Semantic Web, whose volume can potentially grow the scale of the Web. Efficient
management of Semantic Web data, expressed using the W3C's Resource Description
Framework (RDF), is crucial for supporting new data-intensive,
semantics-enabled applications. In this work, we study and compare two
approaches to distributed RDF data management based on emerging cloud computing
technologies and traditional relational database clustering technologies. In
particular, we design distributed RDF data storage and querying schemes for
HBase and MySQL Cluster and conduct an empirical comparison of these approaches
on a cluster of commodity machines using datasets and queries from the Third
Provenance Challenge and Lehigh University Benchmark. Our study reveals
interesting patterns in query evaluation, shows that our algorithms are
promising, and suggests that cloud computing has a great potential for scalable
Semantic Web data management.Comment: In Proc. of the 4th IEEE International Conference on Cloud Computing
(CLOUD'11
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.