Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich

doi:10.18653/v1/2020.coling-main.324

Cited by 4 publications

(6 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A different line of research focuses on integrating position within the attention mechanism (Shaw et al, 2018;Dai et al, 2019;Dufter et al, 2020;Huang et al, 2020;Raffel et al, 2020;Ke et al, 2020;He et al, 2021;Wu et al, 2021). They all improve over Transformer models for various tasks by modifying word and position interactions within the attention matrix and introducing relative position representations as a scalar or vector.…”

Section: Related Workmentioning

confidence: 99%

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

Schmidt,

Di Gangi

2023

Proceedings of the Eighth Conference on Machine Translation

View full text Add to dashboard Cite

Position-based token-mixing approaches, such as FNet and MLPMixer, have shown to be exciting attention alternatives for computer vision and natural language understanding. The motivation is usually to remove redundant operations for higher efficiency on consumer GPUs while maintaining Transformer quality. On the hardware side, research on memristive crossbar arrays shows the possibility of efficiency gains up to two orders of magnitude by performing in-memory computation with weights stored on device. While it is impossible to store dynamic attention weights based on token-token interactions on device, position-based weights represent a concrete alternative if they only lead to minimal degradation. In this paper, we propose position-based attention as a variant of multihead attention where the attention weights are computed from position representations. A naive replacement of token vectors with position vectors in self-attention results in a significant loss in translation quality, which can be recovered by using relative position representations and a gating mechanism. We show analytically that this gating mechanism introduces some form of word dependency and validate its effectiveness experimentally under various conditions. The resulting network, rPosNet, outperforms previous position-based approaches and matches the quality of the Transformer with relative position embedding while requiring 20% less attention parameters after training. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

Schmidt,

Di Gangi

2023

Proceedings of the Eighth Conference on Machine Translation

View full text Add to dashboard Cite

show abstract

“…One line of such work is to add position embeddings to the input before it is fed to the actual Transformer model (Vaswani et al, 2017;Shaw et al, 2018;Devlin et al, 2019;Kitaev et al, 2020;Press et al, 2020;Wang et al, 2020). The second line of work directly modify the attention matrix (Dai et al, 2019;Dufter et al, 2020;He et al, 2020;Wu et al, 2021a;Ke et al, 2021;Su et al, 2021). The last one combine the first two approaches together.…”

Section: Related Workmentioning

confidence: 99%

“…It is noticed that several studies focus on enriching the position information of BERT to improve the performance of natural language understanding (Dai et al, 2019;Dufter et al, 2020;He et al, 2020;Wu et al, 2021a;Ke et al, 2021), e.g., introducing extra learnable parameters to trace the word order. Previous analysis also indicate that the lower layers of BERT tend to capture rich surface-level language structural information such as position information (Jawahar et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks

Luo¹,

Xi²,

Ma³

et al. 2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

Since 2017, the Transformer-based models play critical roles in various downstream Natural Language Processing tasks. However, a common limitation of the attention mechanism utilized in Transformer Encoder is that it cannot automatically capture the information of word order, so explicit position embeddings are generally required to be fed into the target model. In contrast, Transformer Decoder with the causal attention masks is naturally sensitive to the word order. In this work, we focus on improving the position encoding ability of BERT with the causal attention masks. Furthermore, we propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark. Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT w/ PE achieves better overall performance than the baseline systems when pre-training with the same amount of computational resources.

show abstract

“…A similar idea has been explored in (Dufter et al, 2020), where in a more limited setting, i.e., in the context of PoS-tagging, learnable absolute or relative position biases are learned instead of full position embeddings.…”

Section: Relative Position Encodingsmentioning

confidence: 99%

“…Transformer XL(Dai et al, 2019) R MAM 2d + d 2 lh DA-Transformer 2hTUPE (Ke et al, 2020) B MAM 2d 2 + tmax(d + 2)Dufter et al (2020) …”

mentioning

confidence: 99%

Position Information in Transformers: An Overview

Dufter¹,

Schmitt²,

Schütze³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers are arguably the main workhorse in recent Natural Language Processing research. By definition a Transformer is invariant with respect to reorderings of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this paper, we provide an overview of common methods to incorporate position information into Transformer models. The objectives of this survey are to i) showcase that position information in Transformer is a vibrant and extensive research area; ii) enable the reader to compare existing methods by providing a unified notation and meaningful clustering; iii) indicate what characteristics of an application should be taken into account when selecting a position encoding; iv) provide stimuli for future research.

show abstract

Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Cited by 4 publications

References 15 publications

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

Bridging the Gap between Position-Based and Content-Based Self-Attention for Neural Machine Translation

DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks

Position Information in Transformers: An Overview

Contact Info

Product

Resources

About