Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.324
|View full text |Cite
|
Sign up to set email alerts
|

Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention

Abstract: Self-Attention Networks (SANs) are an integral part of successful neural architectures such as Transformer (Vaswani et al., 2017), and thus of pretrained language models such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020). Training SANs on a task or pretraining them on language modeling requires large amounts of data and compute resources. We are searching for modifications to SANs that enable faster learning, i.e., higher accuracies after fewer update steps. We investigate three modifications to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 15 publications
0
6
0
Order By: Relevance
“…A different line of research focuses on integrating position within the attention mechanism (Shaw et al, 2018;Dai et al, 2019;Dufter et al, 2020;Huang et al, 2020;Raffel et al, 2020;Ke et al, 2020;He et al, 2021;Wu et al, 2021). They all improve over Transformer models for various tasks by modifying word and position interactions within the attention matrix and introducing relative position representations as a scalar or vector.…”
Section: Related Workmentioning
confidence: 99%
“…A different line of research focuses on integrating position within the attention mechanism (Shaw et al, 2018;Dai et al, 2019;Dufter et al, 2020;Huang et al, 2020;Raffel et al, 2020;Ke et al, 2020;He et al, 2021;Wu et al, 2021). They all improve over Transformer models for various tasks by modifying word and position interactions within the attention matrix and introducing relative position representations as a scalar or vector.…”
Section: Related Workmentioning
confidence: 99%
“…One line of such work is to add position embeddings to the input before it is fed to the actual Transformer model (Vaswani et al, 2017;Shaw et al, 2018;Devlin et al, 2019;Kitaev et al, 2020;Press et al, 2020;Wang et al, 2020). The second line of work directly modify the attention matrix (Dai et al, 2019;Dufter et al, 2020;He et al, 2020;Wu et al, 2021a;Ke et al, 2021;Su et al, 2021). The last one combine the first two approaches together.…”
Section: Related Workmentioning
confidence: 99%
“…It is noticed that several studies focus on enriching the position information of BERT to improve the performance of natural language understanding (Dai et al, 2019;Dufter et al, 2020;He et al, 2020;Wu et al, 2021a;Ke et al, 2021), e.g., introducing extra learnable parameters to trace the word order. Previous analysis also indicate that the lower layers of BERT tend to capture rich surface-level language structural information such as position information (Jawahar et al, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…A similar idea has been explored in (Dufter et al, 2020), where in a more limited setting, i.e., in the context of PoS-tagging, learnable absolute or relative position biases are learned instead of full position embeddings.…”
Section: Relative Position Encodingsmentioning
confidence: 99%
“…Transformer XL(Dai et al, 2019) R MAM 2d + d 2 lh DA-Transformer 2hTUPE (Ke et al, 2020) B MAM 2d 2 + tmax(d + 2)Dufter et al (2020) …”
mentioning
confidence: 99%