A distributed block coordinate descent method for training $l_1$ regularized linear classifiers

Mahajan, Dhruv; Keerthi, S. Sathiya; Sundararajan, S.

doi:10.48550/arxiv.1405.4544

Cited by 4 publications

(5 citation statements)

References 21 publications

(69 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Training algorithms that can be distributed across multiple machines have been the subject of a significant amount of research. Distributed techniques based on stochastic gradient descent have been proposed (see [17] and [18]) as well as methods based on coordinate descent/ascent (see [7], [19], [20], [21] and [22]). These distributed learning algorithms typically involve each machine (or worker) performing a number of optimization steps to approximately minimize the global objective function using the local data that it has available.…”

Section: Distributed Stochastic Learningmentioning

confidence: 99%

Large-scale stochastic learning using GPUs

Parnell

Duenner

Atasu

et al. 2017

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

In this work we propose an accelerated stochastic learning system for very large-scale applications. Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU implementation of the widely used stochastic coordinate descent/ascent algorithm that can provide up to 35× speedup over a sequential CPU implementation. In order to train on very large datasets that do not fit inside the memory of a single GPU, we then consider techniques for distributed stochastic learning. We propose a novel method for optimally aggregating model updates from worker nodes when the training data is distributed either by example or by feature. Using this technique, we demonstrate that one can scale out stochastic learning across up to 8 worker nodes without any significant loss of training time. Finally, we combine GPU acceleration with the optimized distributed method to train on a dataset consisting of 200 million training examples and 75 million features. We show by scaling out across 4 GPUs, one can attain a high degree of training accuracy in around 4 seconds: a 20× speed-up in training time compared to a multi-threaded, distributed implementation across 4 CPUs.

show abstract

Section: Distributed Stochastic Learningmentioning

confidence: 99%

Large-scale stochastic learning using GPUs

Parnell

Duenner

Atasu

et al. 2017

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

show abstract

“…(Lee and Roth, 2015) derived an analytical solution of the optimal step size for dual linear support vector machine problems. Besides (Mahajan et al, 2013) presented a general framework for distributed optimization based on local functional approximation, which include several first-order and second-order methods as special cases, and (Mahajan et al, 2014) considered each machine to handle a block of coordinates, and proposed distributed block coordinate descent methods for solving ℓ 1 regularized loss minimization problems.…”

Section: Related Workmentioning

confidence: 99%

“…A major challenge is to reduce the training time as much as possible when we increase the number of machines. A practical solution requires two research directions: one is to improve the underlying system design making it suitable for machine learning algorithms (Dean and Ghemawat, 2008;Zaharia et al, 2012;Dean et al, 2012;Li et al, 2014); the other is to adapt traditional single-machine optimization methods to handle data parallelism (Boyd et al, 2011;Yang, 2013;Mahajan et al, 2013;Shamir et al, 2014;Jaggi et al, 2014;Mahajan et al, 2014;Ma et al, 2017;Takáč et al, 2015;Zhang and Lin, 2015). This paper focuses on the latter.…”

Section: Introductionmentioning

confidence: 99%

A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization

Zheng,

Wang,

Xia

et al. 2016

Preprint

View full text Add to dashboard Cite

In modern large-scale machine learning applications, the training data are often partitioned and stored on multiple machines. It is customary to employ the "data parallelism" approach, where the aggregated training loss is minimized without moving data across machines. In this paper, we introduce a novel distributed dual formulation for regularized loss minimization problems that can directly handle data parallelism in the distributed setting. This formulation allows us to systematically derive dual coordinate optimization procedures, which we refer to as Distributed Alternating Dual Maximization (DADM). The framework extends earlier studies described in (Boyd et al., 2011;Ma et al., 2017;Jaggi et al., 2014;Yang, 2013) and has rigorous theoretical analyses. Moreover with the help of the new formulation, we develop the accelerated version of DADM (Acc-DADM) by generalizing the acceleration technique from (Shalev-Shwartz and Zhang, 2014) to the distributed setting. We also provide theoretical results for the proposed accelerated version and the new result improves previous ones (Yang, 2013;Ma et al., 2017) whose iteration complexities grow linearly on the condition number. Our empirical studies validate our theory and show that our accelerated approach significantly improves the previous state-of-the-art distributed dual coordinate optimization algorithms.

show abstract

“…Allen-Zhu and Yuan [2015a] further improve the convergece speed using a novel nonuniform sampling that selects each coordinate with a probability proportional to the square root of the smoothness parameter. Other acceleration techniques Qu and Richtárik [2014], , Nesterov [2012], as well as mini-batch and distributed variants on coordinate method Liu and Wright [2015], Zhao et al [2014], Jaggi et al [2014], Mahajan et al [2014] have been studied in literature. See for a review on the coordinate method.…”

Section: Introductionmentioning

confidence: 99%

Linear convergence of SDCA in statistical estimation

Qu,

2017

Preprint

View full text Add to dashboard Cite

In this paper, we consider stochastic dual coordinate (SDCA) without strongly convex assumption or convex assumption. We show that SDCA converges linearly under mild conditions termed restricted strong convexity. This covers a wide array of popular statistical models including Lasso, group Lasso, and logistic regression with ℓ 1 regularization, corrected Lasso and linear regression with SCAD regularizer. This significantly improves previous convergence results on SDCA for problems that are not strongly convex. As a by product, we derive a dual free form of SDCA that can handle general regularization term, which is of interest by itself.

show abstract

A distributed block coordinate descent method for training $l_1$ regularized linear classifiers

Cited by 4 publications

References 21 publications

Large-scale stochastic learning using GPUs

Large-scale stochastic learning using GPUs

A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization

Linear convergence of SDCA in statistical estimation

Contact Info

Product

Resources

About