TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas

Lecture Notes in Computer Science

Campi

et al. 2017

Self Cite

Many repositories of open data for genomics, collected by worldwide consortia, are important enablers of biological research; moreover, all experimental datasets leading to publications in genomics must be deposited to public repositories and made available to the research community. These datasets are typically used by biologists for validating or enriching their experiments; their content is documented by metadata. However, emphasis on data sharing is not matched by accuracy in data documentation; metadata are not standardized across the sources and often unstructured and incomplete. In this paper, we propose a conceptual model of genomic metadata, whose purpose is to query the underlying data sources for locating relevant experimental datasets. First, we analyze the most typical metadata attributes of genomic sources and define their semantic properties. Then, we use a top-down method for building a global-as-view integrated schema, by abstracting the most important conceptual properties of genomic sources. Finally, we describe the validation of the conceptual model by mapping it to three well-known data sources: TCGA, ENCODE, and Gene Expression Omnibus.

Section: Discussionmentioning

confidence: 99%

Section: Source-specific Views Of Gcmmentioning

confidence: 99%

Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data

Bernasconi

Lecture Notes in Computer Science

Campi

et al. 2017

Self Cite

“…Data files available at the sources are transformed to a same representation, called the Genomic Data Model, GDM [12], which essentially forces every data type used by the data files to become a mapping from regions to a data type-specific feature vector. Format transformations come as the result of significant efforts: for instance, the transformation of TCGA-supported data types to GDM is a long process, with several syntactic and semantic transformations (see TCGA2BED [13]). Metadata, i.e.…”

Section: Geco Resourcesmentioning

confidence: 99%

Data Science for Genomic Data Management: Challenges, Resources, Experiences

Pinoli

2019

SN COMPUT. SCI.

Self Cite

We highlight several challenges which are faced by data scientists who use public datasets for solving biological and clinical problems. In spite of the large efforts in building such public datasets, they are dispersed over many sources and heterogeneous for their formats and sequencing/calling techniques, often meeting highly variable quality standards. Moreover, for most research questions, scientists hardly find datasets with enough samples for building and training machine learning models. Data scarcity depends on the complexity of the genomic domain, with its multi-facets, as well as the lack of organic initiatives to provide standardization across communities. In this paper, we discuss our approach to genomic data management, that can strongly improve the problems of data dispersion and format heterogeneity through high-level abstractions for genomics. We briefly present the computational resources that were recently developed by the GeCo project (ERC Advanced Grant); they include GDM, a Genomic Data Model providing interoperability across data formats; GMQL, a genometric query language for answering data science queries over genomic datasets; and an in-house integrated repository providing attribute-based and keyword-based search over normalized metadata from several open data repositories. We describe these resources at work on a specific research question, and we highlight how we managed to produce a model for addressing such specific research question by overcoming the lack of sufficient samples and labelled datasets.

“…The use of a high-level model and language, such as GDM and GMQL, is the ideal setting for provisioning next generation services over data collected and integrated from these and other repositories, improving over the current state-of-the-art. We already started to work towards an integrated repository: in [7] we discussed a conceptual representation of metadata, where we presented a minimal conceptual schema that includes data typically found in all platforms, albeit with different names and formats; in [12] we discussed the transformation of TCGA datasets into BED format, which is quite similar to GDM.…”

Section: Repositorymentioning

confidence: 99%

Overview of GeCo: A Project for Exploring and Integrating Signals from the Genome

Communications in Computer and Information Science

Bernasconi

Canakoglu

et al. 2018

Self Cite

Next Generation Sequencing is a 10-year old technology for reading the DNA, capable of producing massive amounts of genomic data-in turn, reshaping genomic computing. In particular, tertiary data analysis is concerned with the integration of heterogeneous regions of the genome; this is an emerging and increasingly important problem of genomic computing, because regions carry important signals and the creation of new biological or clinical knowledge requires the integration of these signals into meaningful messages. We specifically focus on how the GeCo project is contributing to tertiary data analysis, by overviewing the main results of the project so far and by describing its future scenarios.