A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

Xin, Ran; Das, Subhro; Khan, Usman A.; Kar, Soummya

doi:10.48550/arxiv.2110.01594

Cited by 8 publications

(20 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To measure the non-stationarity in Problem (2), one should not only consider the stationarity violation at each node but also the consensus errors over the network. Therefore, Xin et al [2021a] and Mancino-Ball et al [2022] define an -stationary point…”

Section: Notion Of Stationaritymentioning

confidence: 99%

“…Wang et al [2021] proposes SPPDM, which uses a proximal primal-dual approach to achieve O( −2 ) sample complexity. ProxGT-SA and ProxGT-SR-O [Xin et al, 2021a] incorporate stochastic gradient tracking and multi-consensus update in proximal gradient methods and obtain O(n −1 −2 ) and O(n −1 −1.5 ) sample complexity respectively, where the latter further uses a SARAH type variance Table 1: Comparison of decentralized proximal gradient based algorithms to find an -stationary solution to stochastic composite optimization in the nonconvex setting. The sample complexity is defined as the number of required samples per agent to obtain an -stationary point (see Definition 1).…”

Section: Introductionmentioning

confidence: 99%

“…A recent work [Mancino-Ball et al, 2022] proposes DEEPSTORM, which leverages a STORM type of variance reduction technique [Cutkosky and Orabona, 2019] and gradient tracking to obtain O(n −1 −1.5 ) and Õ( −1.5 ) sample complexity under different stepsize choices. Nevertheless, existing works either require stronger assumptions [Mancino-Ball et al, 2022] or increasing batch sizes [Wang et al, 2021, Xin et al, 2021a.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization

Xiao¹,

Chen²,

Balasubramanian³

et al. 2023

Preprint

View full text Add to dashboard Cite

We focus on decentralized stochastic non-convex optimization, where n agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: Prox-DASA and Prox-DASA-GT. These algorithms can find -stationary points in O(n −1 −2 ) iterations using constant batch sizes (i.e., O(1)). Unlike prior work, our algorithms achieve a comparable complexity result without requiring large batch sizes, more complex per-iteration operations (such as double loops), or stronger assumptions. Our theoretical findings are supported by extensive numerical experiments, which demonstrate the superiority of our algorithms over previous approaches.

show abstract

Section: Notion Of Stationaritymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization

Xiao¹,

Chen²,

Balasubramanian³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…To avoid this inaccuracy while keeping linear convergence, recent work [28], [29], [35]- [40] propose a gradient tracking technique that allows each node to estimate the global gradient with only local communications. Of note are also distributed stochastic problems where gradient tracking is combined with variance reduction to achieve state-of-the-art results for several different classes of problems [38], [41]- [46].…”

Section: A Related Workmentioning

confidence: 99%

Distributed saddle point problems for strongly concave-convex functions

Qureshi¹,

Khan²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we propose GT-GDA, a distributed optimization method to solve saddle point problems of the form: minx maxy F (x, y) := G(x) + y, P x − H(y) , where the functions G(•), H(•), and the the coupling matrix P are distributed over a strongly connected network of nodes. GT-GDA is a first-order method that uses gradient tracking to eliminate the dissimilarity caused by heterogeneous data distribution among the nodes. In the most general form, GT-GDA includes a consensus over the local coupling matrices to achieve the optimal (unique) saddle point, however, at the expense of increased communication.To avoid this, we propose a more efficient variant GT-GDA-Lite that does not incur the additional communication and analyze its convergence in various scenarios. We show that GT-GDA converges linearly to the unique saddle point solution when G is smooth and convex, H is smooth and strongly convex, and the global coupling matrix P has full column rank. We further characterize the regime under which GT-GDA exhibits a network topology-independent convergence behavior. We next show the linear convergence of GT-GDA-Lite to an error around the unique saddle point, which goes to zero when the coupling cost y, P x is common to all nodes, or when G and H are quadratic. Numerical experiments illustrate the convergence properties and importance of GT-GDA and GT-GDA-Lite for several applications.

show abstract

“…When the network topology is sparse (e.g., a ring or a one-peer exponential graph [3], [33]), each partial averaging step (5) incurs O(1) latency and O(1) transmission time (the inverse of bandwidth), which are independent of n. Since each node only synchronizes with its direct neighbors, there is low synchronization overhead. effective in aggregating information than global averaging, some decentralized algorithms can match or exceed the performance of global-averaging-based distributed algorithms: [1], [29] established that decentralized SGD can achieve the same asymptotic linear speedup in convergence rate as (parameter server based) distributed SGD; [3], [33] used exponential graph topologies to realize both efficient communication and effective aggregation by partial averaging; [37], [38], [31], [39] improved the convergence rate of decentralized SGD by removing data heterogeneity between nodes; [40], [4], [30], [41] enhanced the effectiveness of partial averaging by periodically calling global averaging. BlueFog can implement all these algorithms including those use global averaging.…”

Section: A Concepts and Theoretical Foundationsmentioning

confidence: 99%

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Ying¹,

Yuan²,

Hu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Decentralized algorithm is a form of computation that achieves a global goal through local dynamics that relies on low-cost communication between directly-connected agents. On large-scale optimization tasks involving distributed datasets, decentralized algorithms have shown strong, sometimes superior, performance over distributed algorithms with a central node. Recently, developing decentralized algorithms for deep learning has attracted great attention. They are considered as low-communication-overhead alternatives to those using a parameter server or the Ring-Allreduce protocol. However, the lack of an easy-to-use and efficient software package has kept most decentralized algorithms merely on paper. To fill the gap, we introduce BlueFog, a python library for straightforward, high-performance implementations of diverse decentralized algorithms. Based on a unified abstraction of various communication operations, BlueFog offers intuitive interfaces to implement a spectrum of decentralized algorithms, from those using a static, undirected graph for synchronous operations to those using dynamic and directed graphs for asynchronous operations. BlueFog also adopts several system-level acceleration techniques to further optimize the performance on the deep learning tasks. On mainstream DNN training tasks, BlueFog reaches a much higher throughput and achieves an overall 1.2× ∼ 1.8× speedup over Horovod, a state-of-the-art distributed deep learning package based on Ring-Allreduce. BlueFog is open source at https://github.com/Bluefog-Lib/bluefog.

show abstract

A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

Cited by 8 publications

References 49 publications

A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization

A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization

Distributed saddle point problems for strongly concave-convex functions

BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning

Contact Info

Product

Resources

About