2018
DOI: 10.1101/268755
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reproducible big data science: A case study in continuous FAIRness

Abstract: Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
2
2
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 47 publications
0
6
0
Order By: Relevance
“…Addressing the FAIR principles, Madduri et al [10] presented a set of tools that can be used to support the implementation of computational experiments, and ensure the aspects related to each of the principles. Madduri et al [10] used a case study in biomedicine in which data and computation are often distributed. As in this paper, the need to apply FAIR principles in the experiment as a whole is discussed.…”
Section: Related Workmentioning
confidence: 99%
“…Addressing the FAIR principles, Madduri et al [10] presented a set of tools that can be used to support the implementation of computational experiments, and ensure the aspects related to each of the principles. Madduri et al [10] used a case study in biomedicine in which data and computation are often distributed. As in this paper, the need to apply FAIR principles in the experiment as a whole is discussed.…”
Section: Related Workmentioning
confidence: 99%
“…A review of big data concerning health research was presented in [15]. Yet another case study for big data science to replicate complex analysis was designed in [16]. A framework for parallelization on big data using Apache Hadoop with its Map Reduce function was investigated in [17].…”
Section: Related Workmentioning
confidence: 99%
“…The demand for large volumes of multimodal biomedical data has grown drastically, partially due to active research in personalized medicine, and further understanding diseases [1][2][3] . This shift has made reproducing research findings much more challenging because of the need to ensure the use of adequate data handling methods, resulting in the validity and relevance of studies to be questioned 4,5 .…”
Section: Introductionmentioning
confidence: 99%
“…Even though sharing of data immensely helps in reproducing study results 6 , current sharing practices are inadequate with respect to the size of data and corresponding infrastructure requirements for transfer and storage 2,7 . As computational processing required to process biomedical data is becoming increasingly complex 3 , expertise is now needed for building the tools and workflows for this large-scale handling 1,2 . There have been multiple community efforts in creating standardized workflow languages, such as the Common Workflow Language (CWL) and the Workflow Definition Language (WDL), along with associated workflow management systems such as Snakemake 8 and Nextflow 9 , in order to promote reproducibility 10,11 .…”
Section: Introductionmentioning
confidence: 99%