Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Dai, Zihang; Yang, Zhilin; Yang, Yiming; Carbonell, Jaime G.; Le, Quoc V.; Salakhutdinov, Ruslan

doi:10.18653/v1/p19-1285

Cited by 2,337 publications

(1,636 citation statements)

References 22 publications

Supporting

Mentioning

1,489

Contrasting

Unclassified

Order By: Relevance

“…Most noticeable is the distributed encodings of hidden states using deep neural networks, which can learn to compress, understand, and produce sentences in fluent English (Mikolov et al, 2010;Merity et al, 2017). Current state-of-the-art within language modelling is based on attention architectures (Bahdanau et al, 2014;Vaswani et al, 2017;Dai et al, 2019) and the access to immense computing resources and large datasets (Radford et al, 2018(Radford et al, , 2019. It has been found that these large language models have a profound impact on generating contextual embeddings for NLP tasks (Peters et al, 2018;Radford et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Language modelling for biological sequences – curated datasets and baselines

Armenteros

Winther

et al. 2020

Preprint

View full text Add to dashboard Cite

MotivationLanguage modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed towards analyzing the properties of the datasets used to train language models. Additionally, only the performance of cherry-picked downstream tasks are used to assess the capacity of LMs.ResultsWe analyze the entire UniProt database and investigate the different properties that can bias or hinder the performance of LMs such as homology, domain of origin, quality of the data, and completeness of the sequence. We evaluate n-gram and Recurrent Neural Network (RNN) LMs to assess the impact of these properties on performance. To our knowledge, this is the first protein dataset with an emphasis on language modelling. Our inclusion of properties specific to proteins gives a detailed analysis of how well natural language processing methods work on biological sequences. We find that organism domain and quality of data have an impact on the performance, while the completeness of the proteins has little influence. The RNN based LM can learn to model Bacteria, Eukarya, and Archaea; but struggles with Viruses. By using the LM we can also generate novel proteins that are shown to be similar to real proteins.Availability and implementationhttps://github.com/alrojo/UniLanguage

show abstract

Section: Related Workmentioning

confidence: 99%

“…Historically, perplexity has been the preferred method to evaluate model performance in language modelling literature (Jurafsky and Martin, 2009;Dai et al, 2019;Merity et al, 2017). Perplexity is the exponential of the average log-likelihood and measures how well a language model predicts a sequence of amino acids.…”

Section: Evaluation Metricmentioning

confidence: 99%

Language modelling for biological sequences – curated datasets and baselines

Armenteros

Winther

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The original structure is constructed on the hidden states generated from RNN in order to better capture the long-term dependence and align the output for decoder in RNN. The Transformer model [42] and several follow-up work [8,10,32] showed that for many NLP tasks, the sequence-to-sequence network structure based on attention alone, a.k.a. self-attention mechanism, is able to outperform existing RNN structures in both accuracy and computation complexity in long sequences.…”

Section: Sequential Recommendationmentioning

confidence: 99%

Déjà vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation

Cai

Wang

2020

Proceedings of the Web Conference 2020

View full text Add to dashboard Cite

Predicting users' preferences based on their sequential behaviors in history is challenging and crucial for modern recommender systems. Most existing sequential recommendation algorithms focus on transitional structure among the sequential actions, but largely ignore the temporal and context information, when modeling the influence of a historical event to current prediction.In this paper, we argue that the influence from the past events on a user's current action should vary over the course of time and under different context. Thus, we propose a Contextualized Temporal Attention Mechanism that learns to weigh historical actions' influence on not only what action it is, but also when and how the action took place. More specifically, to dynamically calibrate the relative input dependence from the self-attention mechanism, we deploy multiple parameterized kernel functions to learn various temporal dynamics, and then use the context information to determine which of these reweighing kernels to follow for each input. In empirical evaluations on two large public recommendation datasets, our model consistently outperformed an extensive set of state-of-the-art sequential recommendation methods.

show abstract

“…More recently, we have introduced a transformer-based model for genome annotation tasks, and have achieved state-of-the-art results for the annotation of TSSs, translation initiation sites and methylation sites [6]. The introduced architecture is adapted from the transformer-XL [7], first introduced in the field of natural language processing, and is well suited to process the long nucleotide sequence data. As the transformer architecture does not imply the relative positions of the inputs w.r.t.…”

Section: Introductionmentioning

confidence: 99%

Explainable Transformer Models for Functional Genomics in Prokaryotes

Clauwaert

Menschaert

Waegeman

2020

Preprint

View full text Add to dashboard Cite

The annotation of transcription start sites with computational methods is an important and unsolved problem in genomics. In recent years, several novel experimental methodologies -named Cappable-seq, SMRT-Cappable-seq and SEnd-seq -have been introduced for the detection of transcription start sites and applied on E. coli. In this study, a comparison is made between these new methodologies and the curated transcription start site data set featured by RegulonDB. The analysis between these data sets is facilitated using deep learning techniques that cover both unsupervised and supervised learning, where we expand upon a framework that allows for interpretable deep learning in genomics. This study finds annotations of recent techniques to surpass the quality of annotations provided by Reg-ulonDB. Analysis of the transformer network trained for the detection of TSS in E. coli reveals its attention scores to pinpoint important promoter regions previously discussed in literature. Additionally, findings support the occurrence of a complex interaction between sense and antisense output probabilities, prevalent on key positions for interference of the transcription process.

show abstract

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Cited by 2,337 publications

References 22 publications

Language modelling for biological sequences – curated datasets and baselines

Language modelling for biological sequences – curated datasets and baselines

Déjà vu: A Contextualized Temporal Attention Mechanism for Sequential Recommendation

Explainable Transformer Models for Functional Genomics in Prokaryotes

Contact Info

Product

Resources

About