Adaptive Attention Span in Transformers

Sukhbaatar, Sainbayar; Grave, Édouard; Bojanowski, Piotr; Joulin, Armand

doi:10.48550/arxiv.1905.07799

Cited by 34 publications

(52 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Models based on Transformers (Vaswani et al, 2017), such as BERT (Devlin et al, 2018), or variants (Yang et al, 2019;Lan et al, 2019;Raffel et al, 2019) yield state-of-the-art results in many NLP tasks such as language modeling (Child et al, 2019;Sukhbaatar et al, 2019;Rae et al, 2019;Kitaev et al, 2020), question answering Lan et al, 2019;Zaheer et al, 2020;Beltagy et al, 2020), and summarization (Zhang et al, 2019). However, existing studies show that they do not have good compositional generalization.…”

Section: Transformer Modelsmentioning

confidence: 99%

Making Transformers Solve Compositional Tasks

Ontañón¹,

Ainslie²,

Cvicek³

et al. 2021

Preprint

View full text Add to dashboard Cite

Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. Through this exploration, we identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).

show abstract

Section: Transformer Modelsmentioning

confidence: 99%

Making Transformers Solve Compositional Tasks

Ontañón¹,

Ainslie²,

Cvicek³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Instead of architectural strategies, several approaches are proposed to directly solve O(N 2 ) problem of self-attention mechanism. They are summarized to some categories; Special patterns [20,5,33], exploiting associative law [7,31], using low rank factorization method [40], linear approximation through sampling important tokens [24,46], and using cross-covariance matrices instead of Gram matrices [12]. Although detailed methods are quite different, our UFO scheme is mainly related to utilizing associative law.…”

Section: Related Workmentioning

confidence: 99%

UFO-ViT: High Performance Linear Vision Transformer without Softmax

Song¹

2021

Preprint

View full text Add to dashboard Cite

Vision transformers have become one of the most important models for computer vision tasks. While they outperform earlier convolutional networks, the complexity quadratic to N is one of the major drawbacks when using traditional self-attention algorithms. Here we propose the UFO-ViT(Unit Force Operated Vision Trnasformer), novel method to reduce the computations of self-attention by eliminating some non-linearity. Modifying few of lines from self-attention, UFO-ViT achieves linear complexity without the degradation of performance. The proposed models outperform most transformer-based models on image classification and dense prediction tasks through most capacity regime.

show abstract

“…Cordonnier et al tried to address the redundancy issue with a modified multi-head mechanism [38]. Sukhbaatar et al proposed a modified Transformer with an adaptive attention span [39].…”

Section: Local Multi-headmentioning

confidence: 99%

Local Multi-Head Channel Self-Attention for Facial Expression Recognition

Pecoraro¹,

Basile²,

Bono³

et al. 2021

Preprint

View full text Add to dashboard Cite

Since the Transformer architecture was introduced in 2017 there has been many attempts to bring the selfattention paradigm in the field of computer vision. In this paper we propose a novel self-attention module that can be easily integrated in virtually every convolutional neural network and that is specifically designed for computer vision, the LHC: Local (multi) Head Channel (self-attention). LHC is based on two main ideas: first, we think that in computer vision the best way to leverage the self-attention paradigm is the channel-wise application instead of the more explored spatial attention and that convolution will not be replaced by attention modules like recurrent networks were in NLP; second, a local approach has the potential to better overcome the limitations of convolution than global attention. With LHC-Net we managed to achieve a new state of the art in the famous FER2013 dataset with a significantly lower complexity and impact on the "host" architecture in terms of computational cost when compared with the previous SOTA.

show abstract

Adaptive Attention Span in Transformers

Cited by 34 publications

References 0 publications

Making Transformers Solve Compositional Tasks

Making Transformers Solve Compositional Tasks

UFO-ViT: High Performance Linear Vision Transformer without Softmax

Local Multi-Head Channel Self-Attention for Facial Expression Recognition

Contact Info

Product

Resources

About