Abstract-Large data processing tasks can be effected using workflow management systems. When either the input data or the programs in the pipeline are modified, the workflow must be re-executed to ensure that the final output data is updated to reflect the changes. Since such re-computation can consume substantial resources, optimizing the system to avoid redundant computation is desirable. In the case of a workflow, the dependency relationships between files are specified at the outset and can be leveraged to track which programs need to be re-executed when particular files change. Current distributed systems cannot provide such functionality when no predefined workflows exist. In this paper, we present an architecture that provides functionality to produce both correct output as well as fast re-execution by leveraging the provenance of data to propagate changes along an implicit dependency graph. We explore the tradeoff between storage and availability by presenting a performance analysis of our rollback and re-execution scheme.
Identifying when anomalous activity is correlated in a distributed system is useful for a range of applications from intrusion detection to tracking quality of service. The more specific the logs, the more precise the analysis they allow. However, collecting detailed logs from across a distributed system can deluge the network fabric. We present an architecture that allows fine-grained auditing on individual hosts, space-efficient representation of anomalous activity that can be centrally correlated, and tracing anomalies back to individual files and processes in the system. A key contribution is the design of an anomaly-provenance bridge that allows opaque digests of anomalies to be mapped back to their associated provenance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.