2023
DOI: 10.1101/2023.10.10.561776
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

Gonzalo Benegas,
Carlos Albors,
Alan J. Aw
et al.

Abstract: Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 63 publications
(109 reference statements)
0
8
0
Order By: Relevance
“…Beyond evaluating perplexity, we investigated the model’s zero-shot performance on biologically relevant downstream tasks. For example, language models specifically trained on large corpuses of protein sequences or nucleotide coding sequences have demonstrated an impressive ability to predict mutational effects on protein function (Meier et al, 2021; Notin et al, 2022; Benegas et al, 2023) without any task-specific finetuning or supervision. Because Evo is trained on long genomic sequences that contain protein coding sequences, we tested whether the model would also learn the protein language well enough to perform zero-shot protein function prediction.…”
Section: Resultsmentioning
confidence: 99%
“…Beyond evaluating perplexity, we investigated the model’s zero-shot performance on biologically relevant downstream tasks. For example, language models specifically trained on large corpuses of protein sequences or nucleotide coding sequences have demonstrated an impressive ability to predict mutational effects on protein function (Meier et al, 2021; Notin et al, 2022; Benegas et al, 2023) without any task-specific finetuning or supervision. Because Evo is trained on long genomic sequences that contain protein coding sequences, we tested whether the model would also learn the protein language well enough to perform zero-shot protein function prediction.…”
Section: Resultsmentioning
confidence: 99%
“…Though the importance of evolutionary conservation in triaging functional variants in the human genome has long been appreciated, it is becoming increasingly important as we collect larger and larger samples of human variants, the vast majority of which are extremely rare [173, 174]. In fact, recent work has shown that evolutionary conservation accounts for the vast majority of the predictive power of a state-of-the-art deep learning approach to variant annotation [169, 175]. A limitation of current approaches for utilizing evolutionary conservation is that typically there is no way to include information about the phylogenetic structure of the data (i.e., only multiple sequence alignments are used).…”
Section: Resultsmentioning
confidence: 99%
“…Indeed, the importance of evolutionary conservation in triaging functional variants in the human genome has long been appreciated and is becoming increasingly important as we collect larger samples of people; the same is true for the use of genomics in agriculture [57] and conservation genetics [55]. Recent work showed that evolutionary conservation accounts for the vast majority of the predictive power of a state-of-the-art deep learning approach to variant annotation [149, 150]. But most of the cutting-edge phylogenomic approaches for triaging variants typically do not use the phylogeny at all (i.e., only multiple sequence alignments [MSAs] are used), or include the phylogeny without an explicit evolutionary model [151].…”
Section: Discussionmentioning
confidence: 99%
“…For example, a common operation is computing statistical summaries on per-basepair numeric scores, e.g. from CADD [15] or language model effect predictions [2]. The Sequences trait provides a common programmatic interface to do such operations, whether computing the variance of pathogenic scores in a window or the GC content of a nucleotide sequence.…”
Section: Sequence Typesmentioning
confidence: 99%