Position Information in Transformers: An Overview

Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich

doi:10.48550/arxiv.2102.11090

Cited by 11 publications

(16 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, a number of works demonstrates that such permutation has little to no impact during the pre-training and finetuning stages (Pham et al, 2020;Sinha et al, 2020Sinha et al, , 2021O'Connor and Andreas, 2021;Hessel and Schofield, 2021;Gupta et al, 2021). The latter contradict the common understanding on how the hierarchical and structural information is encoded in LMs (Rogers et al, 2020), and even may question if the word order is modeled with the position embeddings Dufter et al, 2021).…”

Section: Introductionmentioning

confidence: 90%

See 1 more Smart Citation

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Taktasheva¹,

Mikhailov²,

Artemova³

2021

Preprint

View full text Add to dashboard Cite

Recent research has adopted a new experimental field centered around the concept of text perturbations which has revealed that shuffled word order has little to no impact on the downstream performance of Transformer-based language models across many NLP tasks. These findings contradict the common understanding of how the models encode hierarchical and structural information and even question if the word order is modeled with position embeddings. To this end, this paper proposes nine probing datasets organized by the type of controllable text perturbation for three Indo-European languages with a varying degree of word order flexibility: English, Swedish and Russian. Based on the probing analysis of the M-BERT and M-BART models, we report that the syntactic sensitivity depends on the language and model pre-training objectives. We also find that the sensitivity grows across layers together with the increase of the perturbation granularity. Last but not least, we show that the models barely use the positional information to induce syntactic trees from their intermediate self-attention and contextualized representations.

show abstract

Section: Introductionmentioning

confidence: 90%

“…Various PEs have been proposed to utilize the information about word order in the Transformer-based LMs Dufter et al, 2021). Surprisingly, little is known about what PEs capture and how well they learn the meaning of positions.…”

Section: Positional Encodingmentioning

confidence: 99%

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Taktasheva¹,

Mikhailov²,

Artemova³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To impose spatial biases, we found conventional positional embeddings do not form meaningful biases, and use a relative position bias [9,24] instead. The bias is a matrix B ∈ R (2r+1)×(2r+1) , added to the computed attention, where r is the radius specifying the local range of the bias.…”

Section: Semantic Smoothing Transformermentioning

confidence: 99%

CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow

Sui¹,

Li²,

Geng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Optical flow estimation aims to find the 2D motion field by identifying corresponding pixels between two images. Despite the tremendous progress of deep learning-based optical flow methods, it remains a challenge to accurately estimate large displacements with motion blur. This is mainly because the correlation volume, the basis of pixel matching, is computed as the dot product of the convolutional features of the two images. The locality of convolutional features makes the computed correlations susceptible to various noises. On large displacements with motion blur, noisy correlations could cause severe errors in the estimated flow. To overcome this challenge, we propose a new architecture "CRoss-Attentional Flow Transformer" (CRAFT), aiming to revitalize the correlation volume computation. In CRAFT, a Semantic Smoothing Transformer layer transforms the features of one frame, making them more global and semantically stable. In addition, the dot-product correlations are replaced with transformer Cross-Frame Attention. This layer filters out feature noises through the Query and Key projections, and computes more accurate correlations. On Sintel (Final) and KITTI (foreground) benchmarks, CRAFT has achieved new state-of-the-art performance. Moreover, to test the robustness of different models on large motions, we designed an image shifting attack that shifts input images to generate large artificial motions. Under this attack, CRAFT performs much more robustly than two representative methods, RAFT and GMA. The code of CRAFT is is available at https://github.com/askerlee/craft.

show abstract

“…These architectures integrate structural and positional attributes of data when building abstract feature representations. For instances, ConvNets intrinsically consider regular spatial structure for the position of pixels, RNNs also build on the sequential structure of the word positions, and Transformers employ positional encoding of words (see Dufter et al (2021) for a review). For GNNs, the position of nodes is more challenging due to the fact that there does not exist a canonical positioning of nodes in arbitrary graphs.…”

Section: B2 Graph Positional Encodingmentioning

confidence: 99%

Graph Neural Networks with Learnable Structural and Positional Representations

Dwivedi¹,

Luu²,

Thomas³

et al. 2021

Preprint

View full text Add to dashboard Cite

Graph neural networks (GNNs) have become the standard learning architectures for graphs. GNNs have been applied to numerous domains ranging from quantum chemistry, recommender systems to knowledge graphs and natural language processing. A major issue with arbitrary graphs is the absence of canonical positional information of nodes, which decreases the representation power of GNNs to distinguish e.g. isomorphic nodes and other graph symmetries. An approach to tackle this issue is to introduce Positional Encoding (PE) of nodes, and inject it into the input layer, like in Transformers. Possible graph PE are Laplacian eigenvectors. In this work, we propose to decouple structural and positional representations to make easy for the network to learn these two essential properties. We introduce a novel generic architecture which we call LSPE (Learnable Structural and Positional Encodings). We investigate several sparse and fully-connected (Transformer-like) GNNs, and observe a performance increase for molecular datasets, from 2.87% up to 64.14% when considering learnable PE for both GNN classes. 1

show abstract

Position Information in Transformers: An Overview

Cited by 11 publications

References 20 publications

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow

Graph Neural Networks with Learnable Structural and Positional Representations

Contact Info

Product

Resources

About