2020
DOI: 10.1101/2020.09.18.303842
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Orchestrating and sharing large multimodal data for transparent and reproducible research

Abstract: Reproducibility is essential to Open Science, as there is limited relevance for finding that cannot be reproduced by independent research groups, regardless of its validity. It is therefore crucial for scientists to describe their experiments in sufficient detail so they can be reproduced, challenged, and built upon. However, due to recent advances in the biological and computational sciences, it has become difficult to process, analyze, and share data with the community in a manner that is transparent. This h… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
3
2

Relationship

3
2

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 69 publications
0
6
0
Order By: Relevance
“…This includes adhering to the FAIR data principles, along with ensuring that there is a standardized manner in which the pipelines are executed and the data is hosted. To address this important issue, we used ORCESTRA (https://orcestra.ca/), a platform that allows researchers to process biomedical data into unified data objects in a reproducible and transparent manner, where data provenance is tracked (26). At the heart of ORCESTRA is Pachyderm (https://www.pachyderm.com/), an open-source data versioning tool used to execute pipelines processing the molecular and compound screening data, and packaging the datasets into R objects called PharmacoSets (PSets), which are implemented by the PharmacoGx package (27).…”
Section: Implementation Of Reproducible Pipelinesmentioning
confidence: 99%
“…This includes adhering to the FAIR data principles, along with ensuring that there is a standardized manner in which the pipelines are executed and the data is hosted. To address this important issue, we used ORCESTRA (https://orcestra.ca/), a platform that allows researchers to process biomedical data into unified data objects in a reproducible and transparent manner, where data provenance is tracked (26). At the heart of ORCESTRA is Pachyderm (https://www.pachyderm.com/), an open-source data versioning tool used to execute pipelines processing the molecular and compound screening data, and packaging the datasets into R objects called PharmacoSets (PSets), which are implemented by the PharmacoGx package (27).…”
Section: Implementation Of Reproducible Pipelinesmentioning
confidence: 99%
“…We obtained all cell line datasets from the ORCESTRA platform (Mammoliti et al 2020) which stores pharmacogenomics datasets in PharmacoSet (PSet) R objects. Samples with missing values were removed from both the gene expression and drug response data.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…• The Cancer Therapeutics Response Portal (CTRPv2) [1,2] • The Genentech Cell Line Screening Initiative (gCSI) [10,11] • The Genomics of Drug Sensitivity in Cancer (GDSCv1 and GDSCv2) [3,4] We obtained these datasets in the format of PharmacoSet (PSet) which is an R-based data structure that aids in reproducible research for drug sensitivity prediction. PSets are obtained via the ORCESTRA platform (orcestra.ca) [23]. The molecular profiles (RNA-seq) were preprocessed via Kallisto 0.46.1 [50] using GENCODE v33 transcriptome as the reference and the pharmacological profiles (AAC and IC50) were preprocessed and recomputed via PharmacoGx package [49].…”
Section: Datasetsmentioning
confidence: 99%
“…We picked GDSCv1 as the competitor because it is the most common training dataset (Figure 2). GDSCv1 utilizes a different drug screening assay compared to the other datasets and for the majority of the drugs it has a smaller sample size [23]. Our hypothesis is that models trained on CTRPv2 are more generalizable because it utilizes the same assay as other datasets and also has a relatively larger sample size.…”
Section: Experimental Designmentioning
confidence: 99%
See 1 more Smart Citation