vqSGD: Vector Quantized Stochastic Gradient Descent

Gandikota, Venkata; Kane, Daniel M.; Maity, Raj Kumar; Mazumdar, Arya

doi:10.48550/arxiv.1911.07971

Cited by 7 publications

(15 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Works [33,10,24,17] consider empirical mean estimation without assuming any statistical distribution on the data, and quantize the vectors to a small number of bits. There is a significant recent interest in considering the communicationefficient (empirical) mean estimation problem in the context of distributed stochastic gradient descent, see e.g., [1,2,12,5,37,36,34,20,4,23,6,14,27]. These works can broadly be partitioned into three categories: (i) Quantization: encoding each element of the vectors to a small number of bits [18,1,12,5,37,20,23,27], (ii) Sparsification: sending only a subset of elements of the vectors [2,34,36].…”

Section: Related Workmentioning

confidence: 99%

Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation

Jhunjhunwala,

Mallick,

Gadhikar

et al. 2021

Preprint

View full text Add to dashboard Cite

We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are highdimensional, the communication cost of sending entire vectors may be prohibitive, and it may be imperative for them to use sparsification techniques. While most existing work on sparsified mean estimation is agnostic to the characteristics of the data vectors, in many practical applications such as federated learning, there may be spatial correlations (similarities in the vectors sent by different nodes) or temporal correlations (similarities in the data sent by a single node over different iterations of the algorithm) in the data vectors. We leverage these correlations by simply modifying the decoding method used by the server to estimate the mean. We provide an analysis of the resulting estimation error as well as experiments for PCA, K-Means and Logistic Regression, which show that our estimators consistently outperform more sophisticated and expensive sparsification methods.A common thread in the existing techniques [36,34,21] is that they are agnostic to the fact that in many practical applications, the vectors sent by the nodes are correlated across different nodes and 35th Conference on Neural Information Processing Systems (NeurIPS 2021),

show abstract

Section: Related Workmentioning

confidence: 99%

Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation

Jhunjhunwala,

Mallick,

Gadhikar

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…When designing a first-order optimization algorithm under local information constraints, one not only needs to design the optimization algorithm itself, but also the algorithm for local processing of the gradient estimates. Many such algorithms have been proposed in recent years; see, for instance, [DJW14], [ACGMMTZ16], [ASYKM18], [GKMM19], [SVK20], [GDDKS20], and the references therein for privacy constraints; [SFDLY14], [AGLTV17], [SYKM17], [KR18], [FTMARRK20], [RKFR19], [LKH20], [ADSFS19], [CKÖ20], [HHWY19], [MT20b], [MT20a], [SSR20], and the references therein for communication constraints; [Nes13,RT12] for computational constraints. However, these algorithms primarily consider nonadaptive procedures for gradient processing (with the exception of [FTMARRK20]): that is, the scheme used to process the gradients at any iteration cannot depend on the information gleaned from previous iterations.…”

Section: Introductionmentioning

confidence: 99%

Information-constrained optimization: can adaptive processing of gradients help?

Acharya¹,

Canonne²,

Mayekar³

et al. 2021

Preprint

View full text Add to dashboard Cite

We revisit first-order optimization under local information constraints such as local privacy, gradient quantization, and computational constraints limiting access to a few coordinates of the gradient. In this setting, the optimization algorithm is not allowed to directly access the complete output of the gradient oracle, but only gets limited information about it subject to the local information constraints.We study the role of adaptivity in processing the gradient output to obtain this limited information from it. We consider optimization for both convex and strongly convex functions and obtain tight or nearly tight lower bounds for the convergence rate, when adaptive gradient processing is allowed. Prior work was restricted to convex functions and allowed only nonadaptive processing of gradients. For both of these function classes and for the three information constraints mentioned above, our lower bound implies that adaptive processing of gradients cannot outperform nonadaptive processing in most regimes of interest. We complement these results by exhibiting a natural optimization problem under information constraints for which adaptive processing of gradient strictly outperforms nonadaptive processing.

show abstract

“…We remark that while our quantizers are related to the ones used in prior works, our main contribution is to show that our specific design choices yield optimal precision. For instance, the quantizers in [11] expresses the input as a convex combination of set of points, similar to SimQ. In fact, one of the quantizers in [11] uses similar set of points as that of SimQ with a different scaling.…”

Section: Introductionmentioning

confidence: 99%

“…For instance, the quantizers in [11] expresses the input as a convex combination of set of points, similar to SimQ. In fact, one of the quantizers in [11] uses similar set of points as that of SimQ with a different scaling. However, the quantizers in [11] are designed keeping in mind other objectives and they fall short of attaining the optimal precision guarantees of SimQ and SimQ + .…”

Section: Introductionmentioning

confidence: 99%

“…In fact, one of the quantizers in [11] uses similar set of points as that of SimQ with a different scaling. However, the quantizers in [11] are designed keeping in mind other objectives and they fall short of attaining the optimal precision guarantees of SimQ and SimQ + . Also, stochastic optimization over ℓ p spaces using a biased first-order oracle which is constructed by using only statistical query access to the underlying data was considered in [10].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Limits on Gradient Compression for Stochastic Optimization

Mayekar¹,

Tyagi²

2020

Preprint

View full text Add to dashboard Cite

We consider stochastic optimization over ℓp spaces using access to a first-order oracle. We ask: What is the minimum precision required for oracle outputs to retain the unrestricted convergence rates? We characterize this precision for every p ≥ 1 by deriving information theoretic lower bounds and by providing quantizers that (almost) achieve these lower bounds. Our quantizers are new and easy to implement. In particular, our results are exact for p = 2 and p = ∞, showing the minimum precision needed in these settings are Θ(d) and Θ(log d), respectively. The latter result is surprising since recovering the gradient vector will require Ω(d) bits.

show abstract

vqSGD: Vector Quantized Stochastic Gradient Descent

Cited by 7 publications

References 14 publications

Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation

Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation

Information-constrained optimization: can adaptive processing of gradients help?

Limits on Gradient Compression for Stochastic Optimization

Contact Info

Product

Resources

About