Attention Is All You Need

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Łukasz; Polosukhin, Illia

doi:10.48550/arxiv.1706.03762

Cited by 9,823 publications

(4,913 citation statements)

References 22 publications

Supporting

Mentioning

3,863

Contrasting

Unclassified

Order By: Relevance

“…Consequently, inside the language stream, the multiplication of the Query matrix (Q L ) from the language stream and the Key matrix (K V ) from the visual stream produces attention scores over the different image regions based on the question. These attention scores are then passed through a softmax operation to generate respective attention probabilities where i is the co-attention layer number, h is the attention head number, and √ d k is a scaling factor [Vaswani et al, 2017]. These probabilities over the 8 attention heads capture the modulations from each text token to different image regions.…”

Section: Attention Map Generationmentioning

confidence: 99%

“…Figure 1: Co-attention transformer layer Recently, there has been an exciting trend of extending the successful transformer architecture [Vaswani et al, 2017] to solve multi-modal tasks combining modalities including text, audio, images, and videos [Chuang et al, 2019, Gabeur et al, 2020, Sun et al, 2019. This trend has led to significant improvements in state-of-the-art models for Vision-Language tasks like visual grounding, referring expressions, and visual question answering.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

Ankur¹,

Kreiman²

2022

Preprint

View full text Add to dashboard Cite

In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicate that co-attention transformer modules are crucial in attending to relevant regions of the image given a question. Importantly, we observe that the semantic meaning of the question is not what drives visual attention, but specific keywords in the question do. Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models and networks that simultaneously process visual and language streams.Preprint. Under review.

show abstract

Section: Attention Map Generationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

Ankur¹,

Kreiman²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…CovpNq " σ 2 I W ˆHˆL , ' denotes element-wise addition. Motivated by the concepts of the potential of attention [32] and Region of Interest (RoI) [48], this paper devises a simple yet efficient deep-network-based image denoiser with versatility (see Figure 1), in which the training process requires an end-to-end learned dual-self attention region A " tA 1 , A 2 u P R RˆCˆL within a single noisy image merely. Equivalently, this paper studies how to train a region-based image denoiser…”

Section: Introductionmentioning

confidence: 99%

IDEA-Net: Adaptive Dual Self-Attention Network for Single Image Denoising

Zuo

Chen²,

Han

et al. 2022

2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

View full text Add to dashboard Cite

Image denoising is a challenging task due to possible data bias and prediction variance. Existing approaches usually suffer from high computational cost. In this work, we propose an unsupervised image denoiser, dubbed as adaptIve Dual sElf-Attention Network (IDEA-Net), to handle these challenges. IDEA-Net benefits from a generatively learned image-wise dual self-attention region where the denoising process is enforced. Besides, IDEA-Net is not only robust to possible data bias but also helpful to reduce the prediction variance by applying a simplified encoder-decoder with Poisson dropout operations on a single noisy image merely. The proposed IDEA-Net demonstrated the outperformance on four benchmark datasets compared with other single-image-based learning and nonlearning image denoisers. IDEA-Net also shows an appropriate choice to remove real-world noise in low-light and noisy scenes, which in turn, contribute to more accurate dark face detection. The source code is available at https://github.com/zhemingzuo/IDEA-Net.

show abstract

“…

Recent advances in deep learning have thrived under the theme "bigger is better". Modern neural networks yield super-human performance on problems such as image classification and semantic segmentation by introducing higher model complexity [1,2]. However, the training of large networks requires large datasets.

…”

mentioning

confidence: 99%

“…Cycling Flow Process flow diagram of the synchronization steps during the cycling phase where t is the batch number and S is the batches to wait before global synchronization. The weighted average is calculated as shown in Eq (2).…”

mentioning

confidence: 99%

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

et al. 2022

View full text Add to dashboard Cite

With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations after each forward-backward pass. This synchronization is the central algorithmic bottleneck. We introduce the distributed asynchronous and selective optimization (DASO) method, which leverages multi-GPU compute node architectures to accelerate network training while maintaining accuracy. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to current optimized data parallel training methods.

show abstract

Attention Is All You Need

Cited by 9,823 publications

References 22 publications

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

IDEA-Net: Adaptive Dual Self-Attention Network for Single Image Denoising

Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

Contact Info

Product

Resources

About