The first human genome has been sequenced at the turn of the year 2000. Since then, modern biology has made great progresses, also thanks to the introduction of Next-generation sequencing in the mid-2000s. The growing availability of genomic data led to the birth of tertiary analysis, concerning sense-making and extraction of useful biological information. To deal with data heterogeneity, in the last decade many tools have been introduced to achieve genomic data integration: among them, the Genomic Conceptual Model (GCM) and the META-BASE architecture. The latter one allows to map data from many projects into the GCM through an integration pipeline. In this work, we proposed an extension of the GCM to integrate two additional sources into the META-BASE architecture, namely: GWAS Catalog (curated by the NHGRI and EBI institutes) and FinnGen (curated by the University of Helsinki). These two sources host Genome-Wide Association Studies (GWAS), useful for explaining the connection between genome variations of single nucleotides and particular traits. They are organized according to different data models but share the same data semantics. As a result of our integration efforts, we enable the interoperable use and querying of GWAS datasets with several other genomic datasets (including TCGA, ENCODE, Roadmap Epigenomics, 1000 Genomes Project, and GENCODE).
Background Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants – typically single-nucleotide polymorphisms (SNPs) – in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions. Results To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multi-sample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals. Conclusions As a result of the our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.