Manolis Stamatogiannakis scite author profile

Abstract. Knowing the provenance of a data item helps in ascertaining its trustworthiness. Various approaches have been proposed to track or infer data provenance. However, these approaches either treat an executing program as a black-box, limiting the fidelity of the captured provenance, or require developers to modify the program to make it provenance-aware. In this paper, we introduce DataTracker, a new approach to capturing data provenance based on taint tracking, a technique widely used in the security and reverse engineering fields. Our system is able to identify data provenance relations through dynamic instrumentation of unmodified binaries, without requiring access to, or knowledge of, their source code. Hence, we can track provenance for a variety of well-known applications. Because DataTracker looks inside the executing program, it captures high-fidelity and accurate data provenance.

show abstract

Trade-Offs in Automatic Provenance Capture

Stamatogiannakis

Kazmi

Sharif

et al. 2016

View full text Add to dashboard Cite

Abstract-Automatic provenance capture from arbitrary applications is a challenging problem. Different approaches to tackle this problem have evolved, most notably a. system-event trace analysis, b. compile-time static instrumentation, and c. taint flow analysis using dynamic binary instrumentation. Each of these approaches offers different trade-offs in terms of the granularity of captured provenance, integration requirements, and runtime overhead. While these aspects have been discussed separately, a systematic and detailed study, quantifying and elucidating them, is still lacking. To fill this gap, we begin to explore these trade-offs for representative examples of these approaches for automatic provenance capture by means of evaluation and measurement. We base our evaluation on UnixBench-a widely used benchmark suite within systems research. We believe this approach will make our results easier to compare with future studies.

show abstract

PROV _2R

Stamatogiannakis

Αθανασόπουλος

Bos

et al. 2017

ACM Trans. Internet Technol.

View full text Add to dashboard Cite

Information produced by Internet applications is inherently a result of processes that are executed locally. Think of a web server that makes use of a CGI script, or a content management system where a post was first edited using a word processor. Given the impact of these processes to the content published online, a consumer of that information may want to understand what those impacts were. For example, understanding from where text was copied and pasted to make a post, or if the CGI script was updated with the latest security patches, may all influence the confidence on the published content. Capturing and exposing this information provenance is thus important to ascertaining trust to online content. Furthermore, providers of internet applications may wish to have access to the same information for debugging or audit purposes. For processes following a rigid structure (such as databases or workflows), disclosed provenance systems have been developed that efficiently and accurately capture the provenance of the produced data. However, accurately capturing provenance from unstructured processes, for example, user-interactive computing used to produce web content, remains a problem to be tackled. In this article, we address the problem of capturing and exposing provenance from unstructured processes. Our approach, called PROV 2R ( PROV enance R ecord and R eplay) is composed of two parts: (a) the decoupling of provenance analysis from its capture; and (b) the capture of high-fidelity provenance from unmodified programs. We use techniques originating in the security and reverse engineering communities, namely, record and replay and taint tracking . Taint tracking fundamentally addresses the data provenance problem but is impractical to apply at runtime due to extremely high overhead. With a number of case studies, we demonstrate that PROV 2R enables the use of taint analysis for high-fidelity provenance capture, while keeping the runtime overhead at manageable levels. In addition, we show how captured information can be represented using the W3C PROV provenance model for exposure on the Web.

show abstract

PANDAcap

Stamatogiannakis

Bos

Groth

2020

View full text Add to dashboard Cite

and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.