Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1285
|View full text |Cite
|
Sign up to set email alerts
|

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Abstract: Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the context fragmentation problem. As a result, Tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

9
1,489
2
3

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 2,337 publications
(1,636 citation statements)
references
References 22 publications
9
1,489
2
3
Order By: Relevance
“…Most noticeable is the distributed encodings of hidden states using deep neural networks, which can learn to compress, understand, and produce sentences in fluent English (Mikolov et al, 2010;Merity et al, 2017). Current state-of-the-art within language modelling is based on attention architectures (Bahdanau et al, 2014;Vaswani et al, 2017;Dai et al, 2019) and the access to immense computing resources and large datasets (Radford et al, 2018(Radford et al, , 2019. It has been found that these large language models have a profound impact on generating contextual embeddings for NLP tasks (Peters et al, 2018;Radford et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Most noticeable is the distributed encodings of hidden states using deep neural networks, which can learn to compress, understand, and produce sentences in fluent English (Mikolov et al, 2010;Merity et al, 2017). Current state-of-the-art within language modelling is based on attention architectures (Bahdanau et al, 2014;Vaswani et al, 2017;Dai et al, 2019) and the access to immense computing resources and large datasets (Radford et al, 2018(Radford et al, , 2019. It has been found that these large language models have a profound impact on generating contextual embeddings for NLP tasks (Peters et al, 2018;Radford et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…Historically, perplexity has been the preferred method to evaluate model performance in language modelling literature (Jurafsky and Martin, 2009;Dai et al, 2019;Merity et al, 2017). Perplexity is the exponential of the average log-likelihood and measures how well a language model predicts a sequence of amino acids.…”
Section: Evaluation Metricmentioning
confidence: 99%
“…The original structure is constructed on the hidden states generated from RNN in order to better capture the long-term dependence and align the output for decoder in RNN. The Transformer model [42] and several follow-up work [8,10,32] showed that for many NLP tasks, the sequence-to-sequence network structure based on attention alone, a.k.a. self-attention mechanism, is able to outperform existing RNN structures in both accuracy and computation complexity in long sequences.…”
Section: Sequential Recommendationmentioning
confidence: 99%
“…More recently, we have introduced a transformer-based model for genome annotation tasks, and have achieved state-of-the-art results for the annotation of TSSs, translation initiation sites and methylation sites [6]. The introduced architecture is adapted from the transformer-XL [7], first introduced in the field of natural language processing, and is well suited to process the long nucleotide sequence data. As the transformer architecture does not imply the relative positions of the inputs w.r.t.…”
Section: Introductionmentioning
confidence: 99%