Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance

Luise, Giulia; Rudi, Alessandro; Pontil, Massimiliano; Ciliberto, Carlo

doi:10.48550/arxiv.1805.11897

Cited by 9 publications

(13 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The loss function is a custom implementation in TensorFlow of a sharp Sinkhorn [28] usingscaling [19,29,30]. In principle symbolic differentiation should be effective for this problem, however I encountered debilitating numerical instabilities that I was unable to diagnose.…”

Section: A Appendixmentioning

confidence: 99%

“…In principle symbolic differentiation should be effective for this problem, however I encountered debilitating numerical instabilities that I was unable to diagnose. I therefore implemented the explicit gradient introduced in [28]. The Sinkhorn distance is calculated with regulator scaling from 1 to 0.01 in ten log-uniform steps, with ten iterations per step.…”

Section: A Appendixmentioning

confidence: 99%

See 1 more Smart Citation

An Exploration of Learnt Representations of W Jets

Collins¹

2021

Preprint

View full text Add to dashboard Cite

I present a Variational Autoencoder (VAE) trained on collider physics data (specifically boosted W jets), with reconstruction error given by an approximation to the Earth Movers Distance (EMD) between input and output jets. This VAE learns a concrete representation of the data manifold, with semantically meaningful and interpretable latent space directions which are hierarchically organized in terms of their relation to physical EMD scales in the underlying physical generative process. A hyperparameter β controls the resolution at which the VAE is sensitive to structures in the data manifold. The variation of the latent space structure with β, and the scaling of some VAE properties, provide insight into scale dependent structure of the dataset and its information complexity. I introduce two measures of the dimensionality of the learnt representation that are calculated from this scaling. Data, Architecture, and TrainingA sample of 6 × 10 5 W jets were simulated with momenta in the range 500 -600 GeV using a standard pipeline of Madgraph [10], Pythia8 [11], and Delphes [12]. The simulated jets typically contain O(100) particles, and the 50 largest-momentum particles are selected and stored as 3-vectors Preprint. Under review.

show abstract

Section: A Appendixmentioning

confidence: 99%

Section: A Appendixmentioning

confidence: 99%

An Exploration of Learnt Representations of W Jets

Collins¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The Jacobian of an optimization problem solution can also be computed using the implicit function theorem (Griewank and Walther, 2008;Krantz and Parks, 2012;Blondel et al, 2021) instead of backpropagation if the number of iterations becomes a memory bottleneck. Together with Sinkhorn, implicit differentiation has been used by Luise et al (2018) and Cuturi et al (2020).…”

Section: Sinkformersmentioning

confidence: 99%

Sinkformers: Transformers with Doubly Stochastic Attention

E.¹,

Ablin²,

Blondel³

et al. 2021

Preprint

View full text Add to dashboard Cite

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior. On the theoretical side, we show that, unlike the SoftMax operation, this normalization makes it possible to understand the iterations of self-attention modules as a discretized gradient-flow for the Wasserstein metric. We also show in the infinite number of samples limit that, when rescaling both attention matrices and depth, Sinkformers operate a heat diffusion. On the experimental side, we show that Sinkformers enhance model accuracy in vision and natural language processing tasks. In particular, on 3D shapes classification, Sinkformers lead to a significant improvement.

show abstract

“…And 𝑐 (𝒙 𝑖 , 𝒚 𝑗 ) is the cost function evaluating the distance between 𝒙 𝑖 and 𝒚 𝑗 (samples of the two distributions). Computing the optimal distance (1st line) is equivalent to solving the network-flow problem (2nd line) [17]. The calculated matrix T denotes the "transport plan", where each element T 𝑖 𝑗 represents the amount of mass shifted from u 𝑖 to v 𝑗 .…”

Section: Preliminariesmentioning

confidence: 99%

Domain-oriented Language Modeling with Adaptive Hybrid Masking and Optimal Transport Alignment

Zhang

Yuan

Liu

et al. 2021

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Motivated by the success of pre-trained language models such as BERT in a broad range of natural language processing (NLP) tasks, recent research efforts have been made for adapting these models for different application domains. Along this line, existing domainoriented models have primarily followed the vanilla BERT architecture and have a straightforward use of the domain corpus. However, domain-oriented tasks usually require accurate understanding of domain phrases, and such fine-grained phrase-level knowledge is hard to be captured by existing pre-training scheme. Also, the word co-occurrences guided semantic learning of pre-training models can be largely augmented by entity-level association knowledge. But meanwhile, by doing so there is a risk of introducing noise due to the lack of groundtruth word-level alignment. To address the above issues, we provide a generalized domain-oriented approach, which leverages auxiliary domain knowledge to improve the existing pre-training framework from two aspects. First, to preserve phrase knowledge effectively, we build a domain phrase pool as auxiliary training tool, meanwhile we introduce Adaptive Hybrid Masked Model to incorporate such knowledge. It integrates two learning modes, word learning and phrase learning, and allows them to switch between each other. Second, we introduce Cross Entity Alignment to leverage entity association as weak supervision to augment the semantic learning of pre-trained models. To alleviate the potential noise in this process, we introduce an interpretable Optimal Transport based approach to guide alignment learning. Experiments on four domain-oriented tasks demonstrate the superiority of our framework. CCS CONCEPTS• Computing methodologies → Natural language processing.

show abstract

Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance

Cited by 9 publications

References 16 publications

An Exploration of Learnt Representations of W Jets

An Exploration of Learnt Representations of W Jets

Sinkformers: Transformers with Doubly Stochastic Attention

Domain-oriented Language Modeling with Adaptive Hybrid Masking and Optimal Transport Alignment

Contact Info

Product

Resources

About