Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

Pensia, Ankit; Rajput, Shashank; Nagle, Alliot; Vishwakarma, Harit; Papailiopoulos, Dimitris

doi:10.48550/arxiv.2006.07990

Cited by 6 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lottery tickets Frankle & Carbin [30] are a set of small sub-networks derived from a larger dense network, which outperforms their parent networks in convergence speed and potentially in generalization. A huge number of studies are carried out to analyze these tickets both empirically and theoretically: Morcos et al [75] proposed to use one generalized lottery tickets for all vision benchmarks and got comparable results with the specialized lottery tickets; Frankle et al [31] improves the stability of the lottery tickets by iterative pruning; Frankle et al [32] found that subnetworks reach full accuracy only if they are stable against SGD noise during training; Orseau et al [78] provides a logarithmic upper bound for the number of parameters it takes for the optimal sub-networks to exist; Pensia et al [81] suggests a way to construct the lottery ticket by solving the subset sum problem and it's a proof by construction for the strong lottery ticket hypothesis. Furthermore, follow-up works [68,102,96] show that we can find tickets without any training labels.…”

Section: A Extended Related Workmentioning

confidence: 99%

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Dao¹,

Chen²,

Sohoni³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute/memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efficiency-quality tradeoffs, and (2) in denseto-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms). Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution. These properties of Monarch matrices unlock new ways to train and fine-tune sparse and dense models. We empirically validate that Monarch can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications: speeding up ViT and GPT-2 training on ImageNet classification and Wikitext-103 language modeling by 2× with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called "reverse sparsification," Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on OpenWebText by 2× without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds up BERT fine-tuning on GLUE by 1.7× with comparable accuracy.

show abstract

Section: A Extended Related Workmentioning

confidence: 99%

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Dao¹,

Chen²,

Sohoni³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Provable Pruning. Empirical pruning research inspired the development of theoretical foundations for network pruning, including sensitivity-based analysis , coreset methodologies (Mussay et al, 2019;Baykal et al, 2018), and pruning analysis of random networks (Malach et al, 2020;Orseau et al, 2020;Pensia et al, 2020;Ramanujan et al, 2019). Later work analyzed pruned network generalization (Zhang et al, 2021) and the amount of dense network pre-training needed to obtain high-performing sub-networks (Wolfe et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

Wolfe¹,

Kyrillidis²

2021

Preprint

View full text Add to dashboard Cite

We propose a novel, structured pruning algorithm for neural networks-the iterative, Sparse Structured Pruning algorithm, dubbed as i-SpaSP. Inspired by ideas from sparse signal recovery, i-SpaSP operates by iteratively identifying a larger set of important parameter groups (e.g., filters or neurons) within a network that contribute most to the residual between pruned and dense network output, then thresholding these groups based on a smaller, pre-defined pruning ratio. For both two-layer and multi-layer network architectures with ReLU activations, we show the error induced by pruning with i-SpaSP decays polynomially, where the degree of this polynomial becomes arbitrarily large based on the sparsity of the dense network's hidden representations. In our experiments, i-SpaSP is evaluated across a variety of datasets (i.e., MNIST and ImageNet) and architectures (i.e., feed forward networks, ResNet34, and MobileNetV2), where it is shown to discover high-performing sub-networks and improve upon the pruning efficiency of provable baseline methodologies by several orders of magnitude. Put simply, i-SpaSP is easy to implement with automatic differentiation, achieves strong empirical results, comes with theoretical convergence guarantees, and is efficient, thus distinguishing itself as one of the few computationally efficient, practical, and provable pruning algorithms.

show abstract

“…Lottery tickets [Frankle and Carbin, 2018] are a set of small sub-networks derived from a larger dense network, which outperforms their parent networks. Many insightful studies [Morcos et al, 2019, Orseau et al, 2020, Frankle et al, 2019, 2020, Malach et al, 2020, Pensia et al, 2020] are carried out to analyze these tickets, but it remains difficult to generalize to large models due to training cost. In an attempt, follow-up works , Tanaka et al, 2020 show that one can find tickets without training labels.…”

Section: Related Workmentioning

confidence: 99%

“…Lottery tickets Frankle and Carbin [2018] are a set of small sub-networks derived from a larger dense network, which outperforms their parent networks in convergence speed and potentially in generalization. A huge number of studies are carried out to analyze these tickets both empirically and theoretically: Morcos et al [2019] proposed to use one generalized lottery tickets for all vision benchmarks and got comparable results with the specialized lottery tickets; Frankle et al [2019] improves the stability of the lottery tickets by iterative pruning; Frankle et al [2020] found that subnetworks reach full accuracy only if they are stable against SGD noise during training; Orseau et al [2020] provides a logarithmic upper bound for the number of parameters it takes for the optimal sub-networks to exist; Pensia et al [2020] suggests a way to construct the lottery ticket by solving the subset sum problem and it's a proof by construction for the strong lottery ticket hypothesis. Furthermore, follow-up works [Liu and Zenke, 2020, Tanaka et al, 2020 show that we can find tickets without any training labels.…”

Section: M2 Lottery Ticket Hypothesismentioning

confidence: 99%

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Chen¹,

Dao²,

Liang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices. As butterfly matrices are not hardware efficient, we propose simple variants of butterfly (block and flat) to take advantage of modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers (e.g., attention, MLP). We empirically validate that Pixelated Butterfly is 3× faster than butterfly and speeds up training to achieve favorable accuracy-efficiency tradeoffs. On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5× faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in accuracy. * Equal contribution. Order determined by coin flip.

show abstract

Optimal Lottery Tickets via SubsetSum: Logarithmic Over-Parameterization is Sufficient

Cited by 6 publications

References 22 publications

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

Contact Info

Product

Resources

About