Improve Transformer Models with Better Relative Position Embeddings

Huang, Zhiheng; Liang, Davis; Xu, Peng; Xiang, Bing

doi:10.18653/v1/2020.findings-emnlp.298

Cited by 60 publications

(44 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When we generated an additional lightweight convolution based on keys, the model performed worse than composite attention alone (GLUE 74.0 compared to 75.2). This result clarifies the findings of Huang et al (2020), who reported only small improvements from query and key-based relative position embeddings for a subset of the GLUE tasks. Grammaticality judgments were particularly sensitive to position information.…”

Section: Composite Attention Performed the Bestsupporting

confidence: 88%

“…All of our experiments used a convolution kernel size of 17, or eight positions in each direction, a mid-range value that has been found to work well for both relative positions and convolution in language models (Huang et al, 2020;Jiang et al, 2020;Shaw et al, 2018). As in Shaw et al (2018), relative embeddings W C j−i shared weights across heads.…”

Section: Dynamic Convolution (Relative Embeddings)mentioning

confidence: 99%

“…Referring to the visualization in Figure 1, key-based dynamic convolutions correspond to columns instead of rows. These key-based dynamic lightweight convolutions are the same as the relative embeddings proposed in Huang et al (2020), but they are now formulated as dynamic lightweight convolutions.…”

Section: Dynamic Convolution (Relative Embeddings)mentioning

confidence: 99%

“…Our work unites and builds upon previous work using convolutions and relative positions in Transformers. We adopted the relative embeddings from Shaw et al (2018) and Huang et al (2020), showing that these embeddings are equivalent to the dynamic lightweight convolutions in Wu et al (2019). Combining these dynamic lightweight convolutions with fixed lightweight convolutions (equivalent to the relative position terms in Raffel et al 2020), we studied relative embeddings under the framework of convolution integrated with selfattention.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Chang

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position embedding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwiseseparable convolutions in language model pretraining, considering multiple injection points for convolutions in self-attention layers.

show abstract

Section: Composite Attention Performed the Bestsupporting

confidence: 88%

Section: Dynamic Convolution (Relative Embeddings)mentioning

confidence: 99%

Section: Dynamic Convolution (Relative Embeddings)mentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Chang

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…For any fixed offset k, P E pos+k can be represented as a linear function of P Epos. According to the recent progress (Huang et al, 2020), learnable PE and relative position embedding can help to further improve BERT's performances. Therefore, in the refined BERT model, we use learnable PE and relative position representation.…”

Section: Embedding Modulementioning

confidence: 99%

On the application of BERT models for nanopore methylation detection

Zhang

Hatakeyama

Yamaguchi

et al. 2021

Preprint

View full text Add to dashboard Cite

DNA methylation is a common epigenetic modification, which is widely associated with various biological processes, such as gene expression, aging, and disease. Nanopore sequencing provides a promising methylation detection approach through monitoring abnormal signal shifts for detecting modified bases in target motif regions. Recently, model-based approaches, especially those with deep learning models, have achieved significant performance improvements on nanopore methylation detection. In this work, we explore using bidirectional encoder representations from transformers (BERT) for doing the task, which can provide non-recurrent neural structures for fast parallel computation. We find original BERT architecture does not work as well as the bidirectional recurrent neural network (biRNN) on the nanopore methylation prediction task. Through further analysis, we observe recurrent patterns of positional-signal-shift in the context window surrounding target 5-methylcytosine (5mC) and N6-methyladenine (6mA) motifs. We propose a refined BERT with relative position representation and center hidden units concatenation, which takes account of task-specific characters into modeling. We perform systematic evaluations in-sample and cross-sample. The experiment results show that the refined BERT model can achieve competitive or even better results than the state-of-the-art biRNN model, while the model inference speed is about 6x faster. Besides, on the cross-sample evaluation of datasets from the different research groups, BERT models demonstrate a good generalization performance.

show abstract

An improved transformer model with multi-head attention and attention to attention for low-carbon multi-depot vehicle routing problem

Yang

Yin

et al. 2022

Ann Oper Res

View full text Add to dashboard Cite

Improve Transformer Models with Better Relative Position Embeddings

Cited by 60 publications

References 10 publications

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

On the application of BERT models for nanopore methylation detection

An improved transformer model with multi-head attention and attention to attention for low-carbon multi-depot vehicle routing problem

Contact Info

Product

Resources

About