I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

Chard, Kyle; D’Arcy, Mike; Heavner, Ben; Foster, Ian; Kesselman, Carl; Madduri, Ravi; Rodriguez, Alexis; Soiland‐Reyes, Stian; Goble, Carole; Clark, Kristi; Deutsch, Eric W.; Dinov, Ivo D.; Price, Nathan D.; Toga, Arthur W.

doi:10.1109/bigdata.2016.7840618

Cited by 44 publications

(39 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This includes raw data and search results from each of the included experiments, listed with links at http://www.peptideatlas.org/hupo/hppp/repository/. The data are available packaged in BDBags that are uniquely identified with Minids (37). BDBags are compressed archives that contain embedded manifests and checksums that enable automated validation of completeness against the checksums.…”

Section: The Human Plasma Peptideatlas 2017mentioning

confidence: 99%

The Human Plasma Proteome Draft of 2017: Building on the Human Plasma PeptideAtlas from Mass Spectrometry and Complementary Assays

et al. 2017

Self Cite

View full text Add to dashboard Cite

Human blood plasma provides a highly accessible window to the proteome of any individual in health and disease. Since its inception in 2002, the Human Proteome Organization’s Human Plasma Proteome Project (HPPP) has been promoting advances in the study and understanding of the full protein complement of human plasma and on determining the abundance and modifications of its components. In 2017, we review the history of the HPPP and the advances of human plasma proteomics in general, including several recent achievements. We then present the latest 2017-04 build of Human Plasma PeptideAtlas, which yields ~43 million peptide-spectrum matches and 122,730 distinct peptide sequences from 178 individual experiments at a 1% protein-level FDR globally across all experiments. Applying the latest Human Proteome Project Data Interpretation Guidelines, we catalog 3509 proteins that have at least two non-nested uniquely-mapping peptides of 9 amino acids or more and >1300 additional proteins with ambiguous evidence. We apply the same two-peptide guideline to historical PeptideAtlas builds going back to 2006 and examine the progress made in the past ten years in plasma proteome coverage. We also compare the distribution of proteins in historical PeptideAtlas builds in various RNA-abundance and cellular localization categories. We then discuss advances in plasma proteomics based on targeted mass spectrometry as well as affinity assays, which during early 2017 target ~2000 proteins. Finally we describe considerations about sample handling and study design, concluding with an outlook for future advances in deciphering the human plasma proteome.

show abstract

Section: The Human Plasma Peptideatlas 2017mentioning

confidence: 99%

The Human Plasma Proteome Draft of 2017: Building on the Human Plasma PeptideAtlas from Mass Spectrometry and Complementary Assays

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Agents may be co-located with ERMrest or distributed to remote servers or the cloud. And 2) BDBag [9] asset export allows the user to bundle data collections for use in other analysis tools, such as Python, R, or platforms such as Galaxy.…”

Section: The Deriva Platformmentioning

confidence: 99%

“…All assets and data are accessible, subject to the community access control policy (‘A’). The metadata can be exported in standard CSV and BagIt [9] formats, while the digital assets are submitted and made available in standard formats defined by the community (‘I’, ‘R’).…”

Section: Deriva and “Fair” Guidelinesmentioning

confidence: 99%

Experiences with DERIVA: An Asset Management Platform for Accelerating eScience

Bugacov

Czajkowski

Kesselman

et al. 2017

2017 IEEE 13th International Conference on E-Science (E-Science)

Self Cite

View full text Add to dashboard Cite

The pace of discovery in eScience is increasingly dependent on a scientist’s ability to acquire, curate, integrate, analyze, and share large and diverse collections of data. It is all too common for investigators to spend inordinate amounts of time developing ad hoc procedures to manage their data. In previous work, we presented Deriva, a Scientific Asset Management System, designed to accelerate data driven discovery. In this paper, we report on the use of Deriva in a number of substantial and diverse eScience applications. We describe the lessons we have learned, both from the perspective of the Deriva technology, as well as the ability and willingness of scientists to incorporate Scientific Asset Management into their daily workflows.

show abstract

“…The input data from ENCODE consisted of all available DNAse Hypersensitivity (DHS) datasets from 27 tissue types. ENCODE provides metadata for each tissue type which was exported and included in the exported BDBag (Chard et al, 2016). BDBag is a format for defining a dataset and its contents by enumerating the data elements, regardless of their location (enumeration, fixity and distribution) and metadata.…”

Section: Methodsmentioning

confidence: 99%

“…For each tissue type, we started with the fastq files (851 files) available at https://www.encodeproject.org. These files were encapsulated within a BDBag that captured, in an unambiguous manner, references to the raw data alongside complete metadata for processing (Chard et al, 2016). Some ENCODE experiments contain multiple biological samples, while others may contain only a single sample.…”

Section: Methodsmentioning

confidence: 99%

Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data Across 27 Tissue Types

Funk

Jung

Richards

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

There is intense interest in mapping the tissue-specific binding sites of transcription factors in the human genome to reconstruct gene regulatory networks and predict functions for noncoding genetic variation. DNase-seq footprinting provides a means to predict the genome-wide binding sites for hundreds of transcription factors (TFs) simultaneously. However, despite the public availability of DNase-seq data for hundreds of samples, there is neither a unified analytical workflow nor a publicly accessible database providing the locations of footprints across all available samples. Here, we describe the implementation of a workflow for uniform processing of footprints using two state-of-the-art footprinting algorithms: Wellington and HINT. Our workflow then scans footprints for 1,530 sequence motifs to predict binding sites for 1,515 human transcription factors. We tested our workflow using 21 DNase-seq experiments of lymphoblastoid cell lines, generated by the ENCODE project. We trained a machine learning model to predict TF binding sites, integrating footprints with additional biologically-related features. This model achieved a maximum MCC of 0.423 and an AUC of 0.943 compared to ENCODE ChIP-seq data for 62 TFs in the same cell type. We applied our workflow to detect footprints in 206 DNase-seq experiments from ENCODE, spanning 27 human tissues. These footprints describe an expansive landscape of TF occupancy in the human genome. Across all tissues, we detected high-quality footprints spanning 9.8% of all nucleotides in the human genome with scores found to enrich for true positives. The highest tissue-specific coverage was observed for samples in the brain (4.4%), followed by extra-embryonic structure (2.6%) and skin (2.4%). In addition, we report a more lenient footprinting call set, providing some evidence of TF occupancy in at least one tissue for 34% of all genomic positions. Our cloud-based workflow and a database with all footprints and TF binding site predictions are available at www.trena.org.

show abstract

I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

Cited by 44 publications

References 22 publications

The Human Plasma Proteome Draft of 2017: Building on the Human Plasma PeptideAtlas from Mass Spectrometry and Complementary Assays

The Human Plasma Proteome Draft of 2017: Building on the Human Plasma PeptideAtlas from Mass Spectrometry and Complementary Assays

Experiences with DERIVA: An Asset Management Platform for Accelerating eScience

Atlas of Transcription Factor Binding Sites from ENCODE DNase Hypersensitivity Data Across 27 Tissue Types

Contact Info

Product

Resources

About