With recent advances in DNA sequencing technologies, researchers are able to acquire increasingly larger volumes of genomic datasets, enabling the training of powerful models for downstream genomic tasks. However, genome scale dataset often contain many missing values, decreasing the accuracy and power in drawing robust conclusions drawn in genomic analysis. Consequently, imputation of missing information by statistical and machine learning methods has become important. We show that the current state-of-the-art can be advanced significantly by applying a novel variation of the Transformer architecture, called Split-Transformer Impute (STI), coupled with improved preprocessing of data input into deep learning models. We performed extensive experiments to benchmark STI against existing methods using resequencing datasets from human 1000 Genomes Project and yeast genomes. Results establish superior performance of our new methods compared to competing genotype imputation methods in terms of accuracy and imputation quality score in the benchmark datasets.
(1) Background: Phenotype prediction is a pivotal task in genetics in order to identify how genetic factors contribute to phenotypic differences. This field has seen extensive research, with numerous methods proposed for predicting phenotypes. Nevertheless, the intricate relationship between genotypes and complex phenotypes, including common diseases, has resulted in an ongoing challenge to accurately decipher the genetic contribution. (2) Results: In this study, we propose a novel feature selection framework for phenotype prediction utilizing a genetic algorithm (FSF-GA) that effectively reduces the feature space to identify genotypes contributing to phenotype prediction. We provide a comprehensive vignette of our method and conduct extensive experiments using a widely used yeast dataset. (3) Conclusions: Our experimental results show that our proposed FSF-GA method delivers comparable phenotype prediction performance as compared to baseline methods, while providing features selected for predicting phenotypes. These selected feature sets can be used to interpret the underlying genetic architecture that contributes to phenotypic variation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.