Differentiable Convex Optimization Layers

Agrawal, Akshay; Amos, Brandon; Barratt, Shane; Boyd, Stephen; Diamond, Steven; Kolter, Zico

doi:10.48550/arxiv.1910.12430

Cited by 25 publications

(46 citation statements)

References 47 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A typical benefit of implicit models is that the iterates x i do not need to be stored during the forward pass of the network because gradients can be calculated using the implicit function theorem: it bypasses the memory storage issue of GPUs (Wang et al, 2018;Peng et al, 2017;Zhu et al, 2017) during automatic differentiation. Another application is to consider neural architectures that include an argmin layer, for which the output is also formulated as the solution of a nested optimization problem (Agrawal et al, 2019;Gould et al, 2016Gould et al, , 2019.…”

Section: Attention and Gradient Flowsmentioning

confidence: 99%

Sinkformers: Transformers with Doubly Stochastic Attention

E.¹,

Ablin²,

Blondel³

et al. 2021

Preprint

View full text Add to dashboard Cite

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior. On the theoretical side, we show that, unlike the SoftMax operation, this normalization makes it possible to understand the iterations of self-attention modules as a discretized gradient-flow for the Wasserstein metric. We also show in the infinite number of samples limit that, when rescaling both attention matrices and depth, Sinkformers operate a heat diffusion. On the experimental side, we show that Sinkformers enhance model accuracy in vision and natural language processing tasks. In particular, on 3D shapes classification, Sinkformers lead to a significant improvement.

show abstract

Section: Attention and Gradient Flowsmentioning

confidence: 99%

Sinkformers: Transformers with Doubly Stochastic Attention

E.¹,

Ablin²,

Blondel³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Also related are works that seek to enforce constraints on learning problems [115]. While several heuristic algorithms exist for this setting, many focus on restricted classes of constraints [116][117][118][119][120] and those that can handle more general constraints come at the cost of added computation complexity [121,122]. Moreover, each of these works seeks to enforce constraints on a particular parameterization for the learning problem (such as directly on the weights of a neural network) rather than on the underlying statistical problem, as we do in this paper.…”

Section: Further Related Workmentioning

confidence: 99%

Adversarial Robustness with Semi-Infinite Constrained Learning

Robey¹,

Chamon²,

Pappas³

et al. 2021

Preprint

View full text Add to dashboard Cite

Despite strong performance in numerous applications, the fragility of deep learning to input perturbations has raised serious questions about its use in safety-critical domains. While adversarial training can mitigate this issue in practice, state-ofthe-art methods are increasingly application-dependent, heuristic in nature, and suffer from fundamental trade-offs between nominal performance and robustness. Moreover, the problem of finding worst-case perturbations is non-convex and underparameterized, both of which engender a non-favorable optimization landscape. Thus, there is a gap between the theory and practice of adversarial training, particularly with respect to when and why adversarial training works. In this paper, we take a constrained learning approach to address these questions and to provide a theoretical foundation for robust learning. In particular, we leverage semi-infinite optimization and non-convex duality theory to show that adversarial training is equivalent to a statistical problem over perturbation distributions, which we characterize completely. Notably, we show that a myriad of previous robust training techniques can be recovered for particular, sub-optimal choices of these distributions. Using these insights, we then propose a hybrid Langevin Monte Carlo approach of which several common algorithms (e.g., PGD) are special cases. Finally, we show that our approach can mitigate the trade-off between nominal and robust performance, yielding state-of-the-art results on MNIST and CIFAR-10. Our code is available at: https://github.com/arobey1/advbench.

show abstract

“…Challenges in Batch Optimization: Recently, there has been a strong interest in solving several instances of convex quadratic programming (QP) in parallel [15], [16]. The core innovation in [15] lies in rewriting the underlying matrixalgebra such that matrices that do not change with the batch index can be isolated, and their factorization can be prestored.…”

Section: Connections To Sampling Based Trajectory Optimizationmentioning

confidence: 99%

GPU Accelerated Batch Multi-Convex Trajectory Optimization for a Rectangular Holonomic Mobile Robot

Rastgar,

Masnavi,

Kruusamäe

et al. 2021

Preprint

View full text Add to dashboard Cite

Differentiable Convex Optimization Layers

Cited by 25 publications

References 47 publications

Sinkformers: Transformers with Doubly Stochastic Attention

Sinkformers: Transformers with Doubly Stochastic Attention

Adversarial Robustness with Semi-Infinite Constrained Learning

GPU Accelerated Batch Multi-Convex Trajectory Optimization for a Rectangular Holonomic Mobile Robot

Contact Info

Product

Resources

About