Abstract-Mutations in genomes indicate predisposition for diseases or effects on efficacy of drugs. A variant calling algorithm determines possible mutations in sample genomes. Afterwards, scientists have to decide about the impact of these mutations. Certainly, many different variant calling algorithms exist that generate different outputs due to different sequence alignments as input and parameterizations of variant calling algorithms. Thus, a combination of variant calling results is necessary to provide a more complete set of mutations than single algorithm runs can provide. Therefore, a system is required that facilitates the integration and parameterization of different variant calling algorithms and processing of different sequence alignments. Moreover, against the backdrop of ever increasing amounts of available genome sequencing data, such a system must provide matured database management capabilities to enable flexible and efficient analyses while keeping data consistent. In this paper, we present a first approach to integrate variant calling into a main-memory database management system that allows for calling variants via SQL.
Improvements in DNA sequencing technologies allow to sequence complete human genomes in a short time and at acceptable cost. Hence, the vision of genome analysis as standard procedure to support and improve medical treatment becomes reachable. In this vision paper, we describe important data-management challenges that have to be met to make this vision come true. Besides genome-analysis performance, data-management capabilities such as data provenance and data integrity become increasingly important to enable comprehensible and reliable genome analysis. We argue to meet these challenges by using main-memory database technologies, which combine fast processing capabilities with extensive data-management capabilities. Finally, we discuss possibilities of integrating genome-analysis tasks into DBMSs and derive new research questions.
Mutations in genomes can indicate a predisposition for diseases such as cancer or cardiovascular disorder. Genome analysis is an established procedure to determine mutations and deduce their impact on living organisms. The first step in genome analysis is DNA sequencing that makes the biochemically stored hereditary information in DNA digitally readable. The cost and time to sequence a whole genome decreases rapidly and leads to an increase of available raw genome data that must be stored and integrated to be analyzed. Damming this flood of genome data requires efficient and effective analysis as well as data management solutions. State-of-the-art in genome analysis are flat-file-based storage and analysis solutions. Consequently, every analysis application is responsible to manage data on its own, which leads to implementation and process overhead.Database systems have already shown their ability to reduce data management overhead for analysis applications in various domains. However, current approaches using relational database systems for genome-data management lack scalable performance on increasing amounts of genome data. In this thesis, we investigate the capabilities of relational main-memory database systems to store and query genome data efficiently, while enabling flexible data access.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.