The Open Provenance Model is a model of provenance that is designed to meet the following requirements: (1) To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model. (2) To allow developers to build and share tools that operate on such a provenance model. (3) To define provenance in a precise, technologyagnostic manner. (4) To support a digital representation of provenance for any "thing", whether produced by computer systems or not. (5) To allow multiple levels of description to coexist. (6) To define a core set of rules that identify the valid inferences that can be made on provenance representation. This document contains the specification of the Open Provenance Model (v1.1) resulting from a community-effort to achieve inter-operability in the Third Provenance Challenge.
SUMMARYThe first Provenance Challenge was set up in order to provide a forum for the community to understand the capabilities of different provenance systems and the expressiveness of their provenance representations. To this end, a functional magnetic resonance imaging workflow was defined, which participants had to either simulate or run in order to produce some provenance representation, from which a set of identified queries had to be implemented and executed. Sixteen teams responded to the challenge, and submitted their inputs. In this paper, we present the challenge workflow and queries, and summarize the participants' contributions.
In the study of fine art, provenance refers to the documented history of some art object. Given that documented history, the object attains an authority that allows scholars to appreciate its importance with respect to other works, whereas, in the absence of such history, the object may be treated with some skepticism. Our IT landscape is evolving as illustrated by applications that are open, composed dynamically, and that discover results and services on the fly. Against this challenging background, it is crucial for users to be able to have confidence in the results produced by such applications. If the provenance of data produced by computer systems could be determined as it can for some works of art, then users, in their daily applications, would be able to interpret and judge the quality of data better. We introduce a provenance lifecycle and advocate an open approach based on two key principles to support a notion of provenance in computer systems: documentation of execution and user-tailored provenance queries.
The transition from laboratory science to in silico e-science has facilitated a paradigmatic shift in the way we conduct modern science. We can use computationally based analytical models to simulate and investigate scientific questions such as those posed by high-energy physics and bioinformatics, yielding high-quality results and discoveries at an unprecedented rate. However, while experimental media have changed, the scientific methodologies and processes we choose for conducting experiments are still relevant. As in the lab environment, experimental methodology requires samples (or in this case, data) to undergo several processing stages. The staging of operations is what constitutes the in silico experimental process.Initial bioinformatics experiments typically required passing data through several programs in sequence. We'd format the data to conform to applicationdependent file formats and then pass it through selected scientific applications or services, which would yield a handful of results or generate new data. This new data would in turn require reformatting and passing through other services. Often, a bioinformatician would have to manually transfer results between services by noting these values and rekeying them into a new interface or by cutting and pasting. Although problematic and error prone, this approach facilitated scientific exploration through experimentation with different hypotheses using different services. This service-oriented approach underpins emerging technologies such as Web Services and the Grid.The use of workflows formalizes earlier ad hoc approaches for representing experimental methodology. We can represent the stages of in silico experiments formally as a set of services to invoke. Although this formalization can simplify the representation of experimental methodology, referring to specific services limits the utility, portability, and scalability of such workflows. They're prone to the removal or modification of any of the services on which they depend. We can't readily share workflows with colleagues or execute them on other computer infrastructures unless the same services exist on the new infrastructure. Even in an open, shared-services environment, several scientists invoking the same workflow would result in service contention, because each workflow would require the same instances. Additionally, social and human factors add further constraints: to preserve their intellectual property, scientists prefer to publish their experiments' structure while keeping the invoked service instances' details private.By abstracting the workflows, we can construct workflow templates representing the type or class of service to invoke at each experimental stage, without specifying which instance of the service should be used. To use a template, we instantiate the abstracted service representations according to the available services and then manage the data flow appropriately to ensure interoperation between the services. In this article, we address how to use workflow resolution to p...
Service-based architectures enable the development of new classes of Grid and distributed applications. One of the main capabilities provided by such systems is the dynamic and flexible integration of services, according to which services are allowed to be a part of more than one distributed system and simultaneously serve different applications. This increased flexibility in system composition makes it difficult to address classical distributed system issues such as fault-tolerance. While it is relatively easy to make an individual service fault-tolerant, improving fault-tolerance of services collaborating in multiple application scenarios is a challenging task. In this paper, we look at the issue of developing fault-tolerant service-based distributed systems, and propose an infrastructure to implement fault tolerance capabilities transparent to services.
A content-centric network is one which supports host-to-content routing, rather than the host-to-host routing of the existing Internet. This paper investigates the potential of caching data at the router-level in content-centric networks. To achieve this, two measurement sets are combined to gain an understanding of the potential caching benefits of deploying content-centric protocols over the current Internet topology. The first set of measurements is a study of the BitTorrent network, which provides detailed traces of content request patterns. This is then combined with CAIDA's ITDK Internet traces to replay the content requests over a real-world topology. Using this data, simulations are performed to measure how effective content-centric networking would have been if it were available to these consumers/providers. We find that larger cache sizes (10,000 packets) can create significant reductions in packet path lengths. On average, 2.02 hops are saved through caching (a 20% reduction), whilst also allowing 11% of data requests to be maintained within the requester's AS. Importantly, we also show that these benefits extend significantly beyond that of edge caching by allowing transit ASes to also reduce traffic.
Very large scale computations are now becoming routinely used as a methodology to undertake scientific research. In this context, 'provenance systems' are regarded as the equivalent of the scientist's logbook for in silico experimentation: provenance captures the documentation of the process that led to some result. Using a protein compressibility analysis application, we derive a set of generic use cases for a provenance system. In order to support these, we address the following fundamental questions: what is provenance? how to record it? what is the performance impact for grid execution? what is the performance of reasoning? In doing so, we define a technologyindependent notion of provenance that captures interactions between components, internal component information and grouping of interactions, so as to allow us to analyse and reason about the execution of scientific processes. In order to support persistent provenance in heterogeneous applications, we introduce a separate provenance store, in which provenance documentation can be stored, archived and queried independently of the technology used to run the application. Through a series of practical tests, we evaluate the performance impact of such a provenance system. In summary, we demonstrate that provenance recording overhead of our prototype system remains under 10% of execution time, and we show that the recorded information successfully supports our use cases in a performant manner.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2023 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.