Efficient Transformers: A Survey

Tay, Yi; Dehghani, Mostafa; Bahri, Dara; Metzler, Donald

doi:10.1145/3530811

Cited by 498 publications

(340 citation statements)

References 20 publications

Supporting

Mentioning

209

Contrasting

Unclassified

Order By: Relevance

“…Vanilla transformer relies on the multi-head selfattention mechanism, which scales poorly with the length of the input sequence, requiring quadratic computation time and memory to store all scores that are used to compute the gradients during back-propagation (Qiu et al, 2020). Several Transformer-based models (Kitaev et al, 2020;Tay et al, 2020;Choromanski et al, 2021) have been proposed exploring efficient alternatives that can be used to process long sequences.…”

Section: Sparse-attention Transformersmentioning

confidence: 99%

Revisiting Transformer-based Models for Long Document Classification

Ding¹,

Chalkidis²,

Darkner³

et al. 2022

Preprint

View full text Add to dashboard Cite

The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods. We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks. * This work was partially done when Dai was at the University of Copenhagen.

show abstract

Section: Sparse-attention Transformersmentioning

confidence: 99%

Revisiting Transformer-based Models for Long Document Classification

Ding¹,

Chalkidis²,

Darkner³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Our work depends heavily on recent advances in efficient Transformers (Tay et al, 2020) that process long sequences (Rae et al, 2020;Beltagy et al, 2020;Zaheer et al, 2020;Roy et al, 2021). Sparse attention , relative position encoding (Shaw et al, 2018;Raffel et al, 2020;Guo et al, 2021), recurrence mechanism and memory (Dai et al, 2019;Weston et al, 2015; and other tricks (Shen et al, 2020;Katharopoulos et al, 2020;Gupta and Berant, 2020;Stock et al, 2021;Yogatama et al, 2021;Borgeaud et al, 2021;Hawthorne et al, 2022) are commonly adopted by recent Transformer variants to make the operation on long sequences more time/memory efficient.…”

Section: Related Workmentioning

confidence: 99%

ChapterBreak: A Challenge Dataset for Long-Range Language Models

Sun¹,

Thai²,

Iyyer³

2022

Preprint

View full text Add to dashboard Cite

While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce CHAPTERBREAK, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on CHAPTERBREAK show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We publicly release our CHAP-TERBREAK dataset to spur more principled future research into LRLMs. 1

show abstract

“…Here, the input is a (fixed-length) sequence of tokens, which is then fed into multiple layers of self-attention. Lightweight versions such as Dis-tilBERT and others (Tay et al, 2020;Fournier et al, 2021) use less parameters but operate on the same type of input. Recently a new family of models emerged (Tolstikhin et al, 2021;Liu et al, 2021a) which also utilize sequence-based input tokens, with an MLP-based, recurrent-free architecture.…”

Section: Introductionmentioning

confidence: 99%

Bag-of-Words vs. Sequence vs. Graph vs. Hierarchy for Single- and Multi-Label Text Classification

Diera¹,

Lin²,

Khera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Graph neural networks have triggered a resurgence of graph-based text classification methods, defining today's state of the art. We show that a simple multi-layer perceptron (MLP) using a Bag of Words (BoW) outperforms the recent graph-based models TextGCN and Hete-GCN in an inductive text classification setting and is comparable with HyperGAT in singlelabel classification. We also run our own experiments on multi-label classification, where the simple MLP outperforms the recent sequentialbased gMLP and aMLP models. Moreover, we fine-tune a sequence-based BERT and a lightweight DistilBERT model, which both outperform all models on both single-label and multi-label settings in most datasets. These results question the importance of synthetic graphs used in modern text classifiers. In terms of parameters, DistilBERT is still twice as large as our BoW-based wide MLP, while graph-based models like TextGCN require setting up an O(N 2 ) graph, where N is the vocabulary plus corpus size.

show abstract

Efficient Transformers: A Survey

Cited by 498 publications

References 20 publications

Revisiting Transformer-based Models for Long Document Classification

Revisiting Transformer-based Models for Long Document Classification

ChapterBreak: A Challenge Dataset for Long-Range Language Models

Bag-of-Words vs. Sequence vs. Graph vs. Hierarchy for Single- and Multi-Label Text Classification

Contact Info

Product

Resources

About