2022
DOI: 10.1038/s41467-022-33397-4
|View full text |Cite
|
Sign up to set email alerts
|

Deciphering microbial gene function using natural language processing

Abstract: Revealing the function of uncharacterized genes is a fundamental challenge in an era of ever-increasing volumes of sequencing data. Here, we present a concept for tackling this challenge using deep learning methodologies adopted from natural language processing (NLP). We repurpose NLP algorithms to model “gene semantics” based on a biological corpus of more than 360 million microbial genes within their genomic context. We use the language models to predict functional categories for 56,617 genes and find that o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
22
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 22 publications
(24 citation statements)
references
References 67 publications
0
22
0
Order By: Relevance
“…Fourth, adding non-protein modalities (e.g. noncoding regulatory elements 13 ) as input to gLM may also greatly improve gLM's representation of biological sequence data, and can learn protein function and regulation conditioned upon other modalities 47 . One of the most powerful aspects of the transformer-based language models is their potential for transfer learning and fine-tuning.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Fourth, adding non-protein modalities (e.g. noncoding regulatory elements 13 ) as input to gLM may also greatly improve gLM's representation of biological sequence data, and can learn protein function and regulation conditioned upon other modalities 47 . One of the most powerful aspects of the transformer-based language models is their potential for transfer learning and fine-tuning.…”
Section: Discussionmentioning
confidence: 99%
“…Thus, there exists an inherent evolutionary linkage between genes, their genomic context, and gene function [13][14][15] , which can be explored by characterizing patterns that emerge from large metagenomic datasets. Recent efforts to model genomic information have shown predictive power of genomic context in gene function 16 and metabolic trait evolution 17 in bacterial and archaeal genomes. However, these methods represent genes as categorical entities, despite these genes existing in continuous space where multidimensional properties such as phylogeny, structure, and function are abstracted in their sequences.…”
Section: Introductionmentioning
confidence: 99%
“…AI methods implemented range from deep learning 10 to natural language processing (NPL) tools that extract rules from the language of protein sequences in the form of vectorized embeddings (protein language models 11,12 ). In addition, some methods have been created to predict the function of specific proteins, such as enzymes 13 or prokaryotic viral proteins 14 , while others have been successfully used to predict non-GO term-based functions in a wider context 15,16 .…”
Section: Introductionmentioning
confidence: 99%
“…Inspired by the parallel and efficient processing of information in the biological brain, artificial neural networks (ANNs) have received a lot of attention and research and have already achieved tremendous results in fields such as autonomous vehicles, biomedicine, natural language processing, and intelligent terminals . Most ANNs encode information as real-valued vectors for computation rather than as electrical spikes like the human brain, which leads to energy inefficiencies in ANNs.…”
Section: Introductionmentioning
confidence: 99%