Abstract. SPADE is an open source software infrastructure for data provenance collection and management. The underlying data model used throughout the system is graph-based, consisting of vertices and directed edges that are modeled after the node and relationship types described in the Open Provenance Model. The system has been designed to decouple the collection, storage, and querying of provenance metadata. At its core is a novel provenance kernel that mediates between the producers and consumers of provenance information, and handles the persistent storage of records. It operates as a service, peering with remote instances to enable distributed provenance queries. The provenance kernel on each host handles the buffering, filtering, and multiplexing of incoming metadata from multiple sources, including the operating system, applications, and manual curation. Provenance elements can be located locally with queries that use wildcard, fuzzy, proximity, range, and Boolean operators. Ancestor and descendant queries are transparently propagated across hosts until a terminating expression is satisfied, while distributed path queries are accelerated with provenance sketches.
Abstract-Large data processing tasks can be effected using workflow management systems. When either the input data or the programs in the pipeline are modified, the workflow must be re-executed to ensure that the final output data is updated to reflect the changes. Since such re-computation can consume substantial resources, optimizing the system to avoid redundant computation is desirable. In the case of a workflow, the dependency relationships between files are specified at the outset and can be leveraged to track which programs need to be re-executed when particular files change. Current distributed systems cannot provide such functionality when no predefined workflows exist. In this paper, we present an architecture that provides functionality to produce both correct output as well as fast re-execution by leveraging the provenance of data to propagate changes along an implicit dependency graph. We explore the tradeoff between storage and availability by presenting a performance analysis of our rollback and re-execution scheme.
Abstract-Reproducibility has been a cornerstone of the scientific method for hundreds of years. The range of sources from which data now originates, the diversity of the individual manipulations performed, and the complexity of the orchestrations of these operations all limit the reproducibility that a scientist can ensure solely by manually recording their actions. We use an architecture where aggregation, fusion, and composition policies define how provenance records can be automatically merged to facilitate the analysis and reproducibility of experiments. We show that the overhead of collecting and storing provenance metadata can vary dramatically depending on the policy used to integrate it.
Abstract-Previous distributed anomaly detection efforts have operated on summary statistics gathered from each node. This has the advantage that the audit trail is limited in size since event sets can be succinctly represented. While this minimizes the bandwidth consumed and helps scale the detection to a large number of nodes, it limits the infrastructure's ability to identify the source of anomalies. We describe three optimizations that together allow fine-grained tracking of the sources of anomalous activity in a Grid, thereby facilitating precise responses. We demonstrate the scheme's scalability in terms of storage and network bandwidth overhead with an implementation on nodes running BOINC. The results generalize to other types of Grids as well.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.