2021
DOI: 10.48550/arxiv.2110.10090
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Abstract: Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules, where our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables: they can represent spar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
3
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…Our work focuses on the analysis of Maximum Likelihood Estimate (MLE) with transformer function class, which is not covered by previous works. Our bounds are sharper than that ofEdelman et al (2021) on the channel number dependency.…”
mentioning
confidence: 57%
See 2 more Smart Citations
“…Our work focuses on the analysis of Maximum Likelihood Estimate (MLE) with transformer function class, which is not covered by previous works. Our bounds are sharper than that ofEdelman et al (2021) on the channel number dependency.…”
mentioning
confidence: 57%
“…Following this line,Liao et al (2020) ,Ledent et al (2021) andLin and Zhang (2019) built the generalization bound for graph neural networks and convolutional neural network. These results respected the underlying graph structure and the translation-invariance in the networks Edelman et al (2021). established the generalization bound for transformer, but this result did not reflect the permutation-invariance, still depending on the channel number.…”
mentioning
confidence: 84%
See 1 more Smart Citation