DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Ji, Yanrong; Zhou, Zhihan; Liu, Han; Davuluri, Ramana V.

doi:10.1093/bioinformatics/btab083

Cited by 434 publications

(536 citation statements)

References 53 publications

Supporting

Mentioning

429

Contrasting

Order By: Relevance

“…1). Transformers are a class of deep learning models that have achieved substantial breakthroughs in natural language processing (NLP) 6,7 and were also recently applied to model short DNA sequences 8 . They consist of attention layers that transform each position in the input sequence by computing a weighted sum across the representations of all other positions in the sequence.…”

Section: Resultsmentioning

confidence: 99%

Effective gene expression prediction from sequence by integrating long-range interactions

et al. 2021

View full text Add to dashboard Cite

How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer–promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.

show abstract

Section: Resultsmentioning

confidence: 99%

Effective gene expression prediction from sequence by integrating long-range interactions

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Supervised deep learning methods for the prediction of TF occupancy data and and chromatin accessibility are numerous, ranging from the early deep convolutional neural network based pipelines such as DeepSEA 10 and Basset 9 to more recent approaches usually mirroring advances in deep learning methods for natural language processing, such as the LSTM-based DanQ 23 , Basenji using dilated CNNs 11 , DeepSite 24 , and DNA-BERT 25 . While these models have produced highly accurate predictions of TF occupancy, the interpretation of models requires detailed feature attribution (e.g.…”

Section: Discussionmentioning

confidence: 99%

BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin

Kshirsagar

Yuan

Ferres

et al. 2021

Preprint

View full text Add to dashboard Cite

Determining the cell type-specific and genome-wide binding locations of transcription factors (TFs) is an important step towards decoding gene regulatory programs. Profiling by the assay for transposase-accessible chromatin using sequencing (ATAC-seq) reveals open chromatin sites that are potential binding sites for TFs but does not identify which TFs occupy a given site. We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. Our approach automatically learns distinct groups of k-mer patterns that correspond to cell type-specific in vivo binding signals. Latent factors found by BindVAE generally map to TFs that are expressed in the input cell type. BindVAE finds different TF binding sites in different cell types and can learn composite patterns for TFs involved in co-operative binding. BindVAE therefore provides a novel unsupervised approach to deconvolve the complex TF binding signals in chromatin accessible sites.

show abstract

“…DNABERT , in contrast, is the only model, currently, to pre-train BERT-based models using a whole human reference genome [68]. During preprocessing, the genome, whose gaps and unannotated regions were excluded, was split into 5 to 510 consequent nucleotide sequences without overlapping and subsequently converted to 3-to 6-mer representations.…”

Section: Survey Of Representation Learning Applications In Sequence Amentioning

confidence: 99%

Representation learning applications in biological sequence analysis

Iuchi

Matsutani

Yamada

et al. 2021

Preprint

View full text Add to dashboard Cite

Remarkable advances in high-throughput sequencing have resulted in rapid data accumulation, and analyzing biological (DNA/RNA/protein) sequences to discover new insights in biology has become more critical and challenging. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention, because biological sequences are regarded as sentences and k-mers in these sequences as words. Embedding is an essential step in NLP, which converts words into vectors. This transformation is called representation learning and can be applied to biological sequences. Vectorized biological sequences can be used for function and structure estimation, or as inputs for other probabilistic models. Given the importance and growing trend in the application of representation learning in biology, here, we review the existing knowledge in representation learning for biological sequence analysis.

show abstract

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Cited by 434 publications

References 53 publications

Effective gene expression prediction from sequence by integrating long-range interactions

Effective gene expression prediction from sequence by integrating long-range interactions

BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin

Representation learning applications in biological sequence analysis

Contact Info

Product

Resources

About