The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Dalla-Torre, Hugo; Gonzalez, Liam; Revilla, Javier Mendoza; Carranza, Nicolás López; Grywaczewski, Adam Henryk; Oteri, Francesco; Dallago, Christian; Trop, Evan; Sirelkhatim, Hassan; Richard, Guillaume; Skwark, Marcin J.; Beguir, Karim; Lopez, Marie; Pierrot, Thomas

doi:10.1101/2023.01.11.523679

Cited by 73 publications

(182 citation statements)

References 71 publications

Supporting

Mentioning

136

Contrasting

Order By: Relevance

“…Parallel to our work, another study was released which also trains on genomes of multiple species and focuses on human for downstream tasks (Dalla-Torre et al .,2023). The proposed model is species-agnostic but more than one thousand times larger than ours.…”

Section: Discussionmentioning

confidence: 99%

Species-aware DNA language models capture regulatory elements and their evolution

Gankin

Karollus

Grosshauser

et al. 2023

Preprint

View full text Add to dashboard Cite

Motivation: Predicting gene expression from DNA is an open field of research. As in many areas, labeled data is dwarfed by unlabelled data, i.e. species with a sequenced genome but no gene expression assay data. Pretraining on unlabelled data using masked language modeling has proven highly successful in overcoming data constraints in natural language and proteomics. However, in genomics, this approach has so far been applied only to single genomes, neither leveraging conservation of regulatory sequences across species nor the vast amount of available genomes. Results: Here we train a masked language model on more than 800 species spanning over 500 million years of evolution. We show that explicitly modeling species is instrumental in capturing conserved yet evolving regulatory elements and in controlling for oligomer biases. We extract embeddings for 3' untranslated regions of Saccharomyces cerevisiae and Schizosaccharomyces pombe and use them to achieve prediction of mRNA half-life that is better or on-par with the state-of-the-art, demonstrating the utility of the approach for regulatory genomics. Moreover, we show that the per-base reconstruction probability of our model significantly predicts RNA-binding protein bound sites directly. Altogether, our work establishes a self-supervised framework to leverage large genome collections of evolutionary distant species for regulatory genomics and contributes to alignment-free comparative genomics. Availability and implementation: The source code and trained models are available at: https://github.com/DennisGankin/species-aware-DNA-LM .

show abstract

Section: Discussionmentioning

confidence: 99%

Species-aware DNA language models capture regulatory elements and their evolution

Gankin

Karollus

Grosshauser

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…This is where the integration of protein language model ProtTrans has significantly enhanced the utility and effectiveness of the reelGene framework and opened up new avenues for protein functionality research. DNA language models are rapidly advancing, but currently, they are mostly pre-trained on human 54 and vertebrate genomes 55 . Models pre-trained on plant genomes may improve our mRNA and junction boundary models.…”

Section: Discussionmentioning

confidence: 99%

Fishing for a reelGene: evaluating gene models with evolution and machine learning

Schulz,

Zhai,

AuBuchon-Elder

et al. 2023

Preprint

View full text Add to dashboard Cite

Assembled genomes and their associated annotations have transformed our study of gene function. However, each new assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third model uses conserved evolutionary grammar of protein sequences to predict whether or not a gene could produce a protein. Evaluating 1.8 million gene models in maize, reelGene found that 28% were incorrectly annotated or nonfunctional. By leveraging a large cohort of related species and through learning the conserved grammar of proteins, reelGene will provide a tool for both evaluating gene model accuracy and genome biology.

show abstract

“…Although effective these architectures have been outperformed by transformer architectures as evidenced by their adoption in both the computer vision and natural language processing fields [15,16,17]. To address these limitations and exploit the transformer architecture, large language models (LLMs) have gained considerable popularity in the field of biology [18,19,20,21,22,23,24,25,26,27], offering the ability to be trained on unlabeled data and gener-ate general-purpose representations capable of solving specific tasks. Furthermore, LLMs overcome a current limitation of other deep learningbased models, as they are not reliant on single reference genomes, which often provide an incomplete and biased genomic diversity depiction from a limited number of individuals.…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, LLMs overcome a current limitation of other deep learning-based models, as they are not reliant on single reference genomes, which often provide an incomplete and biased genomic diversity depiction from a limited number of individuals. LLMs can leverage multiple reference genomes, including those from genetically distant species, thereby increasing overall diversity, which has been shown to significantly enhance prediction performance [24]. This diversity is particularly relevant in plant species due to the structural complexities of their genomes, which hinder accurate mapping of polymorphisms across whole-genome alignments.…”

Section: Introductionmentioning

confidence: 99%

A Foundational Large Language Model for Edible Plant Genomes

Mendoza-Revilla,

Trop,

Gonzalez

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

In recent years, significant progress has been made in the field of plant genomics, demonstrated by the increased use of high-throughput methodologies that allow for the characterization of multiple genome-wide molecular phenotypes. These results have provided valuable insights into plant traits and their underlying genetic mechanisms, especially in well-researched model plant species. Nonetheless, although acquiring and characterizing these molecular phenotypes can offer valuable insights into plant traits, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a novel foundational large language model trained on reference genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for many genomic elements, including polyadenylation sites, splice sites, open chromatin and enhancer regions. Furthermore, AgroNT can be used to predict the strength of promoter sequences and tissue-specific gene expression levels or prioritize functional variants. Using the cassava genome as an example of an understudied species, we perform a large-scale in silico saturation mutagenesis analysis to assess the impact of >10 million mutations on gene expression levels and enhancer elements in the cassava genome, and provide the results as a valuable resource for regulatory causal variant characterization. Furthermore, owing to the lack of comprehensive benchmarks in the context of deep learning-based methods in plant genomic research, we propose the use of the multiple datasets encompassing seven distinct genomic prediction tasks, which have been compiled here, as the Plants Genomic Benchmark (PGB). The pre-trained AgroNT model is now publicly available on Hugging Face at https://huggingface.co/InstaDeepAI/agro-nt for future research purposes.

show abstract

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Cited by 73 publications

References 71 publications

Species-aware DNA language models capture regulatory elements and their evolution

Species-aware DNA language models capture regulatory elements and their evolution

Fishing for a reelGene: evaluating gene models with evolution and machine learning

A Foundational Large Language Model for Edible Plant Genomes

Contact Info

Product

Resources

About