Abstract. SPADE is an open source software infrastructure for data provenance collection and management. The underlying data model used throughout the system is graph-based, consisting of vertices and directed edges that are modeled after the node and relationship types described in the Open Provenance Model. The system has been designed to decouple the collection, storage, and querying of provenance metadata. At its core is a novel provenance kernel that mediates between the producers and consumers of provenance information, and handles the persistent storage of records. It operates as a service, peering with remote instances to enable distributed provenance queries. The provenance kernel on each host handles the buffering, filtering, and multiplexing of incoming metadata from multiple sources, including the operating system, applications, and manual curation. Provenance elements can be located locally with queries that use wildcard, fuzzy, proximity, range, and Boolean operators. Ancestor and descendant queries are transparently propagated across hosts until a terminating expression is satisfied, while distributed path queries are accelerated with provenance sketches.
Abstract-Reproducibility has been a cornerstone of the scientific method for hundreds of years. The range of sources from which data now originates, the diversity of the individual manipulations performed, and the complexity of the orchestrations of these operations all limit the reproducibility that a scientist can ensure solely by manually recording their actions. We use an architecture where aggregation, fusion, and composition policies define how provenance records can be automatically merged to facilitate the analysis and reproducibility of experiments. We show that the overhead of collecting and storing provenance metadata can vary dramatically depending on the policy used to integrate it.
Abstract-Large data processing tasks can be effected using workflow management systems. When either the input data or the programs in the pipeline are modified, the workflow must be re-executed to ensure that the final output data is updated to reflect the changes. Since such re-computation can consume substantial resources, optimizing the system to avoid redundant computation is desirable. In the case of a workflow, the dependency relationships between files are specified at the outset and can be leveraged to track which programs need to be re-executed when particular files change. Current distributed systems cannot provide such functionality when no predefined workflows exist. In this paper, we present an architecture that provides functionality to produce both correct output as well as fast re-execution by leveraging the provenance of data to propagate changes along an implicit dependency graph. We explore the tradeoff between storage and availability by presenting a performance analysis of our rollback and re-execution scheme.
Abstract-Previous distributed anomaly detection efforts have operated on summary statistics gathered from each node. This has the advantage that the audit trail is limited in size since event sets can be succinctly represented. While this minimizes the bandwidth consumed and helps scale the detection to a large number of nodes, it limits the infrastructure's ability to identify the source of anomalies. We describe three optimizations that together allow fine-grained tracking of the sources of anomalous activity in a Grid, thereby facilitating precise responses. We demonstrate the scheme's scalability in terms of storage and network bandwidth overhead with an implementation on nodes running BOINC. The results generalize to other types of Grids as well.
No abstract
Identifying when anomalous activity is correlated in a distributed system is useful for a range of applications from intrusion detection to tracking quality of service. The more specific the logs, the more precise the analysis they allow. However, collecting detailed logs from across a distributed system can deluge the network fabric. We present an architecture that allows fine-grained auditing on individual hosts, space-efficient representation of anomalous activity that can be centrally correlated, and tracing anomalies back to individual files and processes in the system. A key contribution is the design of an anomaly-provenance bridge that allows opaque digests of anomalies to be mapped back to their associated provenance.
We propose Dagger, a lightweight system to dynamically vet sensitive behaviors in Android apps. Dagger avoids costly instrumentation of virtual machines or modifications to the Android kernel. Instead, Dagger reconstructs the program semantics by tracking provenance relationships and observing apps' runtime interactions with the phone platform. More specifically, Dagger uses three types of low-level execution information at runtime: system calls, Android Binder transactions, and app process details. System call collection is performed via Strace [7], a low-latency utility for Linux and other Unix-like systems. Binder transactions are recorded by accessing Binder module logs via sysfs [8]. App process details are extracted from the Android /proc file system [6]. A data provenance graph is then built to record the interactions between the app and the phone system based on these three types of information. Dagger identifies behaviors by matching the provenance graph with the behavior graph patterns that are previously extracted from the internal working logic of the Android framework. We evaluate Dagger on both a set of over 1200 known malicious Android apps, and a second set of 1000 apps randomly selected from a corpus of over 18,000 Google Play apps. Our evaluation shows that Dagger can effectively vet sensitive behaviors in apps, especially for those using complex obfuscation techniques. We measured the overhead based on a representative benchmark app, and found that both the memory and CPU overhead are less than 10%. The runtime overhead is less than 63%, which is significantly lower than that of existing approaches.
Emerging applications such as search-and-rescue operations, CNS (communication, navigation, surveillance), smart spaces, vehicular networks, mission-critical infrastructure, and disaster control require reliable content distribution under harsh network conditions and all kinds of component failures. In such scenarios, potentially heterogeneous networked components -where the networks lack reliable connectionsneed to be managed to improve scalability, performance, and availability of the overall system. Inspired by delayand disruption-tolerant networking, this paper presents a distributed cross-layer monitoring and optimization method for secure content delivery as a first step toward decentralized content-based mobile ad hoc networking. In particular, we address the availability maximization problem by embedding monitoring and optimization within an existing content-distribution framework. The implications of policies at security, caching, and hardware layers that control in-network storage and hop-by-hop dissemination of content then are analyzed to maximize the content availability in disruptive environments. Additional benefits can be obtained by optimizing the control based on continuously observing the response to anomalies caused by cyber-attacks. For example, if excessive (potentially fraudulent) content is injected, the content distribution system can adapt without significantly compromising the availability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.