2016 IEEE International Conference on Big Data (Big Data) 2016
DOI: 10.1109/bigdata.2016.7840618
|View full text |Cite
|
Sign up to set email alerts
|

I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

Abstract: Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
39
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
2
1
1

Relationship

3
5

Authors

Journals

citations
Cited by 44 publications
(39 citation statements)
references
References 22 publications
0
39
0
Order By: Relevance
“…This includes raw data and search results from each of the included experiments, listed with links at http://www.peptideatlas.org/hupo/hppp/repository/. The data are available packaged in BDBags that are uniquely identified with Minids (37). BDBags are compressed archives that contain embedded manifests and checksums that enable automated validation of completeness against the checksums.…”
Section: The Human Plasma Peptideatlas 2017mentioning
confidence: 99%
“…This includes raw data and search results from each of the included experiments, listed with links at http://www.peptideatlas.org/hupo/hppp/repository/. The data are available packaged in BDBags that are uniquely identified with Minids (37). BDBags are compressed archives that contain embedded manifests and checksums that enable automated validation of completeness against the checksums.…”
Section: The Human Plasma Peptideatlas 2017mentioning
confidence: 99%
“…Agents may be co-located with ERMrest or distributed to remote servers or the cloud. And 2) BDBag [9] asset export allows the user to bundle data collections for use in other analysis tools, such as Python, R, or platforms such as Galaxy.…”
Section: The Deriva Platformmentioning
confidence: 99%
“…All assets and data are accessible, subject to the community access control policy (‘A’). The metadata can be exported in standard CSV and BagIt [9] formats, while the digital assets are submitted and made available in standard formats defined by the community (‘I’, ‘R’).…”
Section: Deriva and “Fair” Guidelinesmentioning
confidence: 99%
“…The input data from ENCODE consisted of all available DNAse Hypersensitivity (DHS) datasets from 27 tissue types. ENCODE provides metadata for each tissue type which was exported and included in the exported BDBag (Chard et al, 2016). BDBag is a format for defining a dataset and its contents by enumerating the data elements, regardless of their location (enumeration, fixity and distribution) and metadata.…”
Section: Methodsmentioning
confidence: 99%
“…For each tissue type, we started with the fastq files (851 files) available at https://www.encodeproject.org. These files were encapsulated within a BDBag that captured, in an unambiguous manner, references to the raw data alongside complete metadata for processing (Chard et al, 2016). Some ENCODE experiments contain multiple biological samples, while others may contain only a single sample.…”
Section: Methodsmentioning
confidence: 99%