Sinkformers: Transformers with Doubly Stochastic Attention

E., Sander, Michael; Ablin, Pierre; Blondel, Mathieu; Peyré, Gabriel

doi:10.48550/arxiv.2110.11773

Cited by 3 publications

(3 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, some attempts have been made to design neural networks to imitate the Sinkhorn-based algorithms of OT problems, such as the Gumbel-Sinkhorn network [55], the sparse Sinkhorn attention model [56], the Sinkhorn autoencoder [57], and the Sinkhorn-based transformer [58]. Focusing on pooling layers, some OT-based solutions have been proposed as well.…”

Section: Optimal Transport-based Machine Learningmentioning

confidence: 99%

Regularized Optimal Transport Layers for Generalized Global Pooling Operations

Xu,

Cheng

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Global pooling is one of the most significant operations in many machine learning models and tasks, which works for information fusion and structured data (like sets and graphs) representation. However, without solid mathematical fundamentals, its practical implementations often depend on empirical mechanisms and thus lead to sub-optimal, even unsatisfactory performance. In this work, we develop a novel and generalized global pooling framework through the lens of optimal transport. The proposed framework is interpretable from the perspective of expectation-maximization. Essentially, it aims at learning an optimal transport across sample indices and feature dimensions, making the corresponding pooling operation maximize the conditional expectation of input data. We demonstrate that most existing pooling methods are equivalent to solving a regularized optimal transport (ROT) problem with different specializations, and more sophisticated pooling operations can be implemented by hierarchically solving multiple ROT problems. Making the parameters of the ROT problem learnable, we develop a family of regularized optimal transport pooling (ROTP) layers. We implement the ROTP layers as a new kind of deep implicit layer. Their model architectures correspond to different optimization algorithms. We test our ROTP layers in several representative set-level machine learning scenarios, including multi-instance learning (MIL), graph classification, graph set representation, and image classification. Experimental results show that applying our ROTP layers can reduce the difficulty of the design and selection of global pooling -our ROTP layers may either imitate some existing global pooling methods or lead to some new pooling layers fitting data better. The code is available at https://github.com/SDS-Lab/ROT-Pooling.

show abstract

Section: Optimal Transport-based Machine Learningmentioning

confidence: 99%

Regularized Optimal Transport Layers for Generalized Global Pooling Operations

Xu,

Cheng

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…Besides the Sinkhorn algorithm, some other algorithms are developed, e.g., the Bregman ADMM (Wang & Banerjee, 2014;Ye et al, 2017;Xu, 2020) and the smoothed semi-dual algorithm (Blondel et al, 2018). More recently, some attempts have been made to design neural networks to imitate the Sinkhorn-based algorithms of OT problems, e.g., the Gumbel-Sinkhorn network (Mena et al, 2018), the sparse Sinkhorn attention model (Tay et al, 2020), the Sinkhorn autoencoder (Patrini et al, 2020), and the Sinkhorn-based transformer (Sander et al, 2021). However, these methods ignore the potentials of other algorithms.…”

Section: Related Workmentioning

confidence: 99%

Revisiting Pooling through the Lens of Optimal Transport

Cheng¹,

Xu²

2022

Preprint

View full text Add to dashboard Cite

Pooling is one of the most significant operations in many machine learning models and tasks, whose implementation, however, is often empirical in practice. In this paper, we develop a novel and solid algorithmic pooling framework through the lens of optimal transport. In particular, we demonstrate that most existing pooling methods are equivalent to solving some specializations of an unbalanced optimal transport (UOT) problem. Making the parameters of the UOT problem learnable, we unify most existing pooling methods in the same framework, and accordingly, propose a generalized pooling layer called UOT-Pooling for neural networks. Moreover, we implement the UOT-Pooling with two different architectures, based on the Sinkhorn scaling algorithm and the Bregman ADMM algorithm, respectively, and study their stability and efficiency quantitatively. We test our UOT-Pooling layers in two application scenarios, including multi-instance learning (MIL) and graph embedding. For state-of-the-art models of these two tasks, we can improve their performance by replacing conventional pooling layers with our UOT-Pooling layers.

show abstract

“…Finally, we would like to mention that more recently in [24] the authors propose Sinkformers 2 , a variation of the transformer architecture [25] where the learnable attention matrices are forced to be doubly stochastic using Sinkhorn's algorithm [26]. They consider the case where the attention blocks have tied weights between layers and show theoretically that, in the infinite depth limit, Sinkformers correspond to a Wasserstein gradient flow.…”

Section: Related Workmentioning

confidence: 99%

FlowPool: Pooling Graph Representations with Wasserstein Gradient Flows

Simou¹

2021

Preprint

View full text Add to dashboard Cite

In several machine learning tasks for graph structured data, the graphs under consideration may be composed of a varying number of nodes. Therefore, it is necessary to design pooling methods that aggregate the graph representations of varying size to representations of fixed size which can be used in downstream tasks, such as graph classification. Existing graph pooling methods offer no guarantee with regards to the similarity of a graph representation and its pooled version. In this work we address this limitation by proposing FlowPool, a pooling method that optimally preserves the statistics of a graph representation to its pooled counterpart by minimizing their Wasserstein distance. This is achieved by performing a Wasserstein gradient flow with respect to the pooled graph representation. We propose a versatile implementation of our method which can take into account the geometry of the representation space through any ground cost. This implementation relies on the computation of the gradient of the Wasserstein distance with recently proposed implicit differentiation schemes. Our pooling method is amenable to automatic differentiation and can be integrated in end-to-end deep learning architectures. Further, FlowPool is invariant to permutations and can therefore be combined with permutation equivariant feature extraction layers in GNNs in order to obtain predictions that are independent of the ordering of the nodes. Experimental results demonstrate that our method leads to an increase in performance compared to existing pooling methods when evaluated in graph classification tasks.

show abstract

Sinkformers: Transformers with Doubly Stochastic Attention

Cited by 3 publications

References 27 publications

Regularized Optimal Transport Layers for Generalized Global Pooling Operations

Regularized Optimal Transport Layers for Generalized Global Pooling Operations

Revisiting Pooling through the Lens of Optimal Transport

FlowPool: Pooling Graph Representations with Wasserstein Gradient Flows

Contact Info

Product

Resources

About