Rethinking Attention with Performers

Choromański, Krzysztof; Likhosherstov, Valerii; Dohan, D.; Song, Xingyou; Gane, Andreea; Sarlós, Tamás; Hawkins, Peter; Davis, J. K.; Mohiuddin, Afroz; Kaiser, Łukasz; Belanger, David; Colwell, Lucy; Weller, Adrian

doi:10.48550/arxiv.2009.14794

Cited by 183 publications

(329 citation statements)

References 22 publications

Supporting

Mentioning

318

Contrasting

Order By: Relevance

“…More recently, in the context of transformer architectures, a number of approximations have been proposed to reduce the complexity of such computations to be linear in the number of kernel points O(N ). A non-exhaustive list of references include Linformers [52], Performers [53], Nyströformers [54] and Fast Transformers [55].…”

Section: Discussionmentioning

confidence: 99%

Learning Operators with Coupled Attention

Kissas¹,

Seidman²,

Guilhoto³

et al. 2022

Preprint

View full text Add to dashboard Cite

Supervised operator learning is an emerging machine learning paradigm with applications to modeling the evolution of spatio-temporal dynamical systems and approximating general black-box relationships between functional data. We propose a novel operator learning method, LOCA (Learning Operators with Coupled Attention), motivated from the recent success of the attention mechanism. In our architecture, the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions, enabling us to approximate nonlinear operators even when the number of output function in the training set measurements is very small. Our formulation is accompanied by rigorous approximation theoretic guarantees on the universal expressiveness of the proposed model. Empirically, we evaluate the performance of LOCA on several operator learning scenarios involving systems governed by ordinary and partial differential equations, as well as a black-box climate prediction problem. Through these scenarios we demonstrate state of the art accuracy, robustness with respect to noisy input data, and a consistently small spread of errors over testing data sets, even for out-of-distribution prediction tasks.

show abstract

Section: Discussionmentioning

confidence: 99%

Learning Operators with Coupled Attention

Kissas¹,

Seidman²,

Guilhoto³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to the quadratic computational complexity, the computation of full attention is unaffordable when dealing with long sequence tokens. Therefore, many works design efficient transformers, aiming to reduce computational complexity (Katharopoulos et al, 2020;Choromanski et al, 2020;Lee et al, 2019;Ying et al, 2018). Current efficient transformers can be categorized into three classes.…”

Section: Related Workmentioning

confidence: 99%

“…Current efficient transformers can be categorized into three classes. 1) Linear approximate attention (Katharopoulos et al, 2020;Choromanski et al, 2020;Beltagy et al, 2020;Zaheer et al, 2020) approximates the full attention matrix by linearizing the softmax attention and thus can accelerate the computation by first computing the product of keys and values. 2) Inducing point-based linear transformers (Lee et al, 2019;Ying et al, 2018) use learned inducing points with fixed size to compute attention with input tokens, thus can reduce the computation to linear complexity.…”

Section: Related Workmentioning

confidence: 99%

QuadTree Attention for Vision Transformers

Tang¹,

Zhang²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformers have been successful in many vision tasks, thanks to their capability of capturing long-range dependency. However, their quadratic computational complexity poses a major obstacle for applying them to vision tasks requiring dense predictions, such as object detection, feature matching, stereo, etc. We introduce QuadTree Attention, which reduces the computational complexity from quadratic to linear. Our quadtree transformer builds token pyramids and computes attention in a coarse-to-fine manner. At each level, the top K patches with the highest attention scores are selected, such that at the next level, attention is only evaluated within the relevant regions corresponding to these top K patches. We demonstrate that quadtree attention achieves state-of-theart performance in various vision tasks, e.g. with 4.0% improvement in feature matching on ScanNet, about 50% flops reduction in stereo matching, 0.4-1.5% improvement in top-1 accuracy on ImageNet classification, 1.2-1.8% improvement on COCO object detection, and 0.7-2.4% improvement on semantic segmentation over previous state-of-the-art transformers. The codes are available at https://github.com/Tangshitao/QuadtreeAttention.

show abstract

“…Besides, efficient transformers are proposed, which may reduce the time complexity of self-attention from quadratic to linear (or log-linear). For exam-ple, Linformer and Performer (Choromanski et al, 2020) leverage low-rank selfattention; Sparse Transformers (Child et al, 2019) and Big Bird (Zaheer et al, 2020) utilize sparse self-attention; Reformer introduces learnable attention patterns, and Synthesizer (Tay et al, 2021) introduces randomized attention patterns.…”

Section: Related Workmentioning

confidence: 99%

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

Zhang¹,

Liu²

2022

Preprint

View full text Add to dashboard Cite

News feed recommendation is an important web service. In recent years, pre-trained language models (PLMs) have been intensively applied to improve the recommendation quality. However, the utilization of these deep models is limited in many aspects, such as lack of explainability and being incompatible with the existing inverted index systems. Above all, the PLMs based recommenders are inefficient, as the encoding of user-side information will take huge computation costs. Although the computation can be accelerated with efficient transformers or distilled PLMs, it is still not enough to make timely recommendations for the active users, who are associated with super long news browsing histories.In this work, we tackle the efficient news recommendation problem from a distinctive perspective. Instead of relying on the entire input (i.e., the collection of news articles a user ever browsed), we argue that the user's interest can be fully captured merely with those representative keywords. Motivated by this, we propose GateFormer, where the input data is gated before feeding into transformers. The gating module is made personalized, lightweight and end-to-end learnable, such that it may perform accurate and efficient filtering of informative user input. GateFormer achieves highly impressive performances in experiments, where it notably outperforms the existing acceleration approaches in both accuracy and efficiency. We also surprisingly find that even with over 10-fold compression of the original input, GateFormer is still able to maintain onpar performances with the SOTA methods.

show abstract

Rethinking Attention with Performers

Cited by 183 publications

References 22 publications

Learning Operators with Coupled Attention

Learning Operators with Coupled Attention

QuadTree Attention for Vision Transformers

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

Contact Info

Product

Resources

About