2021
DOI: 10.1101/2021.02.26.433129
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Representation learning applications in biological sequence analysis

Abstract: Remarkable advances in high-throughput sequencing have resulted in rapid data accumulation, and analyzing biological (DNA/RNA/protein) sequences to discover new insights in biology has become more critical and challenging. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention, because biological sequences are regarded as sentences and k-mers in these sequences as words. Embedding is an essential step in NLP, which converts wo… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
13
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(13 citation statements)
references
References 84 publications
0
13
0
Order By: Relevance
“…Analyzing and deciphering biological sequences plays a critical role in gaining a deeper understanding of biological systems. Recent major advances in artificial intelligence have sparked ample interest in adopting natural language processing (NLP) models to extract hidden insights from biological sequences [1]. By reinterpreting protein sequences as sentences and k-mers in these sequences as words, researchers have succeeded in establishing computational methods to represent the language of life.…”
Section: Introductionmentioning
confidence: 99%
“…Analyzing and deciphering biological sequences plays a critical role in gaining a deeper understanding of biological systems. Recent major advances in artificial intelligence have sparked ample interest in adopting natural language processing (NLP) models to extract hidden insights from biological sequences [1]. By reinterpreting protein sequences as sentences and k-mers in these sequences as words, researchers have succeeded in establishing computational methods to represent the language of life.…”
Section: Introductionmentioning
confidence: 99%
“…Additionally, BERT, which essentially consists of stacked Transformer encoder layers, shows enhanced performance in down-stream task-specific predictions after pre-training on a massive dataset (Devlin et al , 2019). In the field of bioinformatics, several BERT architectures pre-trained on a massive corpus of protein sequences have been recently proposed, demonstrating their capability to decode the context of biological sequences (Rao et al , 2019; Rives et al , 2021; Elnaggar et al , 2021; Iuchi et al , 2021). In comparison to the protein language models, Ji et al (2021) a pre-trained BERT model, named DNABERT, on a whole human reference genome demonstrated its broad applicability for predicting promoter regions, splicing sites, and transcription factor binding sites upon fine-tuning.…”
Section: Introductionmentioning
confidence: 99%
“…Analyzing and deciphering biological sequences plays a critical role in gaining a deeper understanding of biological systems. Recent major advances in artificial intelligence have sparked ample interests in adopting natural language processing (NLP) models to extract hidden insights from biological sequences [1]. By reinterpreting protein sequences as sentences and k-mers in these sequences as words, researchers have succeeded in establishing computational methods to represent the language of life.…”
Section: Introductionmentioning
confidence: 99%