2017
DOI: 10.1186/s12859-016-1419-5
|View full text |Cite
|
Sign up to set email alerts
|

TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas

Abstract: BackgroundData extraction and integration methods are becoming essential to effectively access and take advantage of the huge amounts of heterogeneous genomics and clinical data increasingly available. In this work, we focus on The Cancer Genome Atlas, a comprehensive archive of tumoral data containing the results of high-throughout experiments, mainly Next Generation Sequencing, for more than 30 cancer types.ResultsWe propose TCGA2BED a software tool to search and retrieve TCGA data, and convert them in the s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
7
2

Relationship

4
5

Authors

Journals

citations
Cited by 37 publications
(26 citation statements)
references
References 27 publications
0
26
0
Order By: Relevance
“…Our GMQL system already provides access to datasets from TCGA, EN-CODE, and Roadmap Epigenomics, that were identified as the most relevant in the course of collaborative projects with many biologists; we already developed some tools for automatically importing such datasets and for converting them to an integrated format, e.g., TCGA2BED [7]. Thanks to GCM, we can also provide a coherent semantics to the metadata of integrated sources; throughout the GeCo project we plan to add more sources, according to needs of biologists, and to continuously integrate their metadata within GCM.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Our GMQL system already provides access to datasets from TCGA, EN-CODE, and Roadmap Epigenomics, that were identified as the most relevant in the course of collaborative projects with many biologists; we already developed some tools for automatically importing such datasets and for converting them to an integrated format, e.g., TCGA2BED [7]. Thanks to GCM, we can also provide a coherent semantics to the metadata of integrated sources; throughout the GeCo project we plan to add more sources, according to needs of biologists, and to continuously integrate their metadata within GCM.…”
Section: Discussionmentioning
confidence: 99%
“…-We enclose fixed human curated values in inverted commas and use the functions notation tr, comb, and curated to describe a transformation of a source field, a combination of multiple source fields, and curated fields, respectively. Note that the Gene Expression Omnibus (GEO) source is at the same time a very rich public repository of genomic data (as most research publications include links to experimental data uploaded to GEO), but is also a very poor source of metadata, which are not well structured and often lack information; hence our mapping effort is harder and less precise for GEO than for the more organized TCGA and ENCODE sources 7 . The mapping to GEO captures as well the mapping to Roadmap Epigenomics, another relevant source of public data.…”
Section: Source-specific Views Of Gcmmentioning
confidence: 99%
“…Data files available at the sources are transformed to a same representation, called the Genomic Data Model, GDM [12], which essentially forces every data type used by the data files to become a mapping from regions to a data type-specific feature vector. Format transformations come as the result of significant efforts: for instance, the transformation of TCGA-supported data types to GDM is a long process, with several syntactic and semantic transformations (see TCGA2BED [13]). Metadata, i.e.…”
Section: Geco Resourcesmentioning
confidence: 99%
“…The use of a high-level model and language, such as GDM and GMQL, is the ideal setting for provisioning next generation services over data collected and integrated from these and other repositories, improving over the current state-of-the-art. We already started to work towards an integrated repository: in [7] we discussed a conceptual representation of metadata, where we presented a minimal conceptual schema that includes data typically found in all platforms, albeit with different names and formats; in [12] we discussed the transformation of TCGA datasets into BED format, which is quite similar to GDM.…”
Section: Repositorymentioning
confidence: 99%