Houjun Tang scite author profile

Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations.In this paper, we analyze three representative scientific workflows in collaboration with the domain scientists to identify concrete provenance needs. Based on the first-hand analysis, we propose a provenance framework called PROV-IO, which includes an I/O-centric provenance model for describing scientific data and the associated I/O operations and environments precisely. Moreover, we build a prototype of PROV-IO to enable end-to-end provenance support on real HPC systems with little manual effort. The PROV-IO framework provides flexibility in selecting various classes of provenance. Our experiments with realistic workflows show that PROV-IO can address the provenance needs of the domain scientists effectively with reasonable performance (e.g., less than 3.5% tracking overhead for most experiments). Moreover, PROV-IO outperforms a state-of-the-art system (i.e., ProvLake) in our experiments. CCS CONCEPTS• Computer systems organization → Parallel architectures; • Information systems → Data management systems.

show abstract

H5bench: a benchmark suite for parallel HDF5 (H5bench) v0.1

Byna

Tang

et al. 2021

View full text Add to dashboard Cite

Asynchronous I/O VOL Connector (AsyncVOL) v0.1

Tang

Koziol

Byna

2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Houjun Tang

Improving nonnegative matrix factorization with advanced graph regularization

Prov-Io

H5bench: a benchmark suite for parallel HDF5 (H5bench) v0.1

Asynchronous I/O VOL Connector (AsyncVOL) v0.1

Contact Info

Product

Resources

About