2021
DOI: 10.48550/arxiv.2110.11773
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sinkformers: Transformers with Doubly Stochastic Attention

Abstract: Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic ma… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
2
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 27 publications
0
2
0
Order By: Relevance
“…Recently, some attempts have been made to design neural networks to imitate the Sinkhorn-based algorithms of OT problems, such as the Gumbel-Sinkhorn network [55], the sparse Sinkhorn attention model [56], the Sinkhorn autoencoder [57], and the Sinkhorn-based transformer [58]. Focusing on pooling layers, some OT-based solutions have been proposed as well.…”
Section: Optimal Transport-based Machine Learningmentioning
confidence: 99%
“…Recently, some attempts have been made to design neural networks to imitate the Sinkhorn-based algorithms of OT problems, such as the Gumbel-Sinkhorn network [55], the sparse Sinkhorn attention model [56], the Sinkhorn autoencoder [57], and the Sinkhorn-based transformer [58]. Focusing on pooling layers, some OT-based solutions have been proposed as well.…”
Section: Optimal Transport-based Machine Learningmentioning
confidence: 99%
“…Besides the Sinkhorn algorithm, some other algorithms are developed, e.g., the Bregman ADMM (Wang & Banerjee, 2014;Ye et al, 2017;Xu, 2020) and the smoothed semi-dual algorithm (Blondel et al, 2018). More recently, some attempts have been made to design neural networks to imitate the Sinkhorn-based algorithms of OT problems, e.g., the Gumbel-Sinkhorn network (Mena et al, 2018), the sparse Sinkhorn attention model (Tay et al, 2020), the Sinkhorn autoencoder (Patrini et al, 2020), and the Sinkhorn-based transformer (Sander et al, 2021). However, these methods ignore the potentials of other algorithms.…”
Section: Related Workmentioning
confidence: 99%
“…Finally, we would like to mention that more recently in [24] the authors propose Sinkformers 2 , a variation of the transformer architecture [25] where the learnable attention matrices are forced to be doubly stochastic using Sinkhorn's algorithm [26]. They consider the case where the attention blocks have tied weights between layers and show theoretically that, in the infinite depth limit, Sinkformers correspond to a Wasserstein gradient flow.…”
Section: Related Workmentioning
confidence: 99%