Universal Approximation Under Constraints is Possible with Transformers

Kratsios, Anastasis; Zamanlooy, Behnoosh; Liu, Tianlin

doi:10.48550/arxiv.2110.03303

Cited by 3 publications

(4 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, an increasing number of researchers begin to explore the representation power of the transformer in 3D (Liu et al, 2019a;Fuchs et al, 2020;Misra et al, 2021;Mao et al, 2021;Sander et al, 2022) Another important line of work seeks to theoretically demonstrate the representation power of the transformer by showing the universal approximation of continuous sequence-tosequence functions (Yun et al, 2019;Zaheer et al, 2020;Shi et al, 2021;Kratsios et al, 2021). To be specific, Yun et al (2019) demonstrated the universal approximation property of the transformer; Yun et al (2020) and Zaheer et al (2020) demonstrated that the transformer with sparse attention matrix remains a universal approximator; Shi et al (2021) claimed that the transformer without diag-attention is still a universal approximator.…”

Section: Related Workmentioning

confidence: 99%

“…To be specific, Yun et al (2019) demonstrated the universal approximation property of the transformer; Yun et al (2020) and Zaheer et al (2020) demonstrated that the transformer with sparse attention matrix remains a universal approximator; Shi et al (2021) claimed that the transformer without diag-attention is still a universal approximator. Kratsios et al (2021) proposed that the universal approximation under constraints is possible for the transformer.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Sampled Transformer for Point Sets

Li¹,

Walder²,

Soen³

et al. 2023

Preprint

View full text Add to dashboard Cite

The sparse transformer can reduce the computational complexity of the self-attention layers to O(n), whilst still being a universal approximator of continuous sequence-to-sequence functions. However, this permutation variant operation is not appropriate for direct application to sets. In this paper, we proposed an O(n) complexity sampled transformer that can process point set elements directly without any additional inductive bias. Our sampled transformer introduces random element sampling, which randomly splits point sets into subsets, followed by applying a shared Hamiltonian self-attention mechanism to each subset. The overall attention mechanism can be viewed as a Hamiltonian cycle in the complete attention graph, and the permutation of point set elements is equivalent to randomly sampling Hamiltonian cycles. This mechanism implements a Monte Carlo simulation of the O(n 2 ) dense attention connections. We show that it is a universal approximator for continuous set-to-set functions. Experimental results on point-clouds show comparable or better accuracy with significantly reduced computational complexity compared to the dense transformer or alternative sparse attention schemes.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Sampled Transformer for Point Sets

Li¹,

Walder²,

Soen³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To date, there are few -if any -Jackson or Berstein-type results for sequence modelling using the Transformer. We mention a related series of works on static function approximation with a variant of the Transformer architecture [1,48,49]. Here, the targets are continuous functions H : [0, 1] τ → K, and K ⊂ R n is a compact set.…”

Section: Attention-based Architecturesmentioning

confidence: 99%

A Brief Survey on the Approximation Theory for Sequence Modelling

Jiang¹,

Li²,

null³

et al. 2023

JML

View full text Add to dashboard Cite

show abstract

“…Self-attention-based models such as transformers have also been studied theoretically from different aspects. Many works focused on their approximation capability [16,18,27,56]. Studies had also been done on the Turing completeness [34,53], in-context learning [9,55], and inductive bias [11] of the models.…”

Section: Self-attentionmentioning

confidence: 99%

Why Self-Attention Is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries

Chao Ma,

Lexing Ying

2023

JML

View full text Add to dashboard Cite

In this paper, we show that structures similar to self-attention are natural for learning many sequence-to-sequence problems from the perspective of symmetry. Inspired by language processing applications, we study the orthogonal equivariance of seq2seq functions with knowledge, which are functions taking two inputs -an input sequence and a knowledge -and outputting another sequence. The knowledge consists of a set of vectors in the same embedding space as the input sequence, containing the information of the language used to process the input sequence. We show that orthogonal equivariance in the embedding space is natural for seq2seq functions with knowledge, and under such equivariance, the function must take a form close to self-attention. This shows that network structures similar to self-attention are the right structures for representing the target function of many seq2seq problems. The representation can be further refined if a finite information principle is considered, or a permutation equivariance holds for the elements of the input sequence.

show abstract

Universal Approximation Under Constraints is Possible with Transformers

Cited by 3 publications

References 30 publications

Sampled Transformer for Point Sets

Sampled Transformer for Point Sets

A Brief Survey on the Approximation Theory for Sequence Modelling

Why Self-Attention Is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries

Contact Info

Product

Resources

About