2019
DOI: 10.1371/journal.pone.0213013
|View full text |Cite|
|
Sign up to set email alerts
|

Reproducible big data science: A case study in continuous FAIRness

Abstract: Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
3
1

Relationship

2
8

Authors

Journals

citations
Cited by 33 publications
(30 citation statements)
references
References 47 publications
0
30
0
Order By: Relevance
“…One solution is archiving the big datasets in online repositories or data stores and including the existing persistent identifiers and checksums in the RO instead of the actual data files, as previously demonstrated with BDBags [91,154]. While CWL executors like toil-cwl-runner can be configured to deposit data in a shared repository, the cwltool reference implementation explored in this study can only write to the local file system.…”
Section: Discussion and Future Directionsmentioning
confidence: 99%
“…One solution is archiving the big datasets in online repositories or data stores and including the existing persistent identifiers and checksums in the RO instead of the actual data files, as previously demonstrated with BDBags [91,154]. While CWL executors like toil-cwl-runner can be configured to deposit data in a shared repository, the cwltool reference implementation explored in this study can only write to the local file system.…”
Section: Discussion and Future Directionsmentioning
confidence: 99%
“…BibTeX format suitable for importing into citation managers such as EndNote, Mendeley or JabRef ), cross-referencing FaceBase data with publications and other knowledge resources in the field, and socializing the craniofacial research community to the practice and importance of data citation, thus promoting data as a key contribution to science. The FaceBase platform has adopted best practices on research resource identifiers (Madduri et al, 2019), the BDBag semantic information exchange format with ability to describe data and its provenance (Chard et al, 2016), widely used vocabularies for clear description of data, and FAIR research principles (Wilkinson et al, 2016).…”
Section: Transforming Scholarly Communicationmentioning
confidence: 99%
“…Our primary goal as Microbiology Resource Announcements (MRA) editors is to ensure that a manuscript’s techniques and protocols are thoroughly documented so that readers can understand the strengths and weaknesses not only of a particular genome assembly but also the underlying raw data. Given the importance of clarity of workflows and reproducibility of data in validating scientific results (13), we want to ensure that all of the relevant data contributing to an assembly are available for other researchers so that they can (i) reproduce the study’s results, (ii) elaborate and incorporate the available data into other genome assemblies, or (iii) repurpose public data for use in alternative analyses. While many of these current best practices have been incorporated into the Instructions to Authors, in this opinion piece, we aim to provide a set of thematic ideas and examples behind certain instructions for authors to increase reproducibility across groups and utility for future users.…”
Section: Editorialmentioning
confidence: 99%