Distributed Coordinate Descent for L1-regularized Logistic Regression

Trofimov, Ilya; Genkin, Alexander

doi:10.1007/978-3-319-26123-2_24

Cited by 4 publications

(6 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These methods have attracted considerable attention in the past few years, and include SCD [69], RCDM [49], UCDC [59], ICD [77], PCDM [60], SPCDM [14], SPDC [86], APCG [37], RCD [44], APPROX [15], QUARTZ [55] and ALPHA [53]. Recent advances on mini-batch and distributed variants can be found in [38], [90], [58], [16], [79], [25], [43] and [41]. Other related work includes [47,13,1,89,17,76].…”

Section: Stochastic Dual Coordinate Ascent With Adaptive Probabilitie...mentioning

confidence: 99%

Data Sampling Strategies in Stochastic Algorithms for Empirical Risk Minimization

Csiba

2018

Preprint

View full text Add to dashboard Cite

First and foremost, I would like to express my gratitude towards my first supervisor Peter Richtárik. His genuine interest in my progress throughout my academic years has pushed me further than I would ever expect. On top of his guidance in research, he gave me a lot of teaching advice, which I appreciate a lot.I would like to thank my second supervisor Charles Sutton and my PCDS scholarship mentors Mike Davies and Ilias Diakonikolas for numerous discussions on topics related to my research field, especially for the connections and insights to machine learning, compressed sensing, and statistics.

show abstract

Section: Stochastic Dual Coordinate Ascent With Adaptive Probabilitie...mentioning

confidence: 99%

Data Sampling Strategies in Stochastic Algorithms for Empirical Risk Minimization

Csiba

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Training algorithms that can be distributed across multiple machines have been the subject of a significant amount of research. Distributed techniques based on stochastic gradient descent have been proposed (see [17] and [18]) as well as methods based on coordinate descent/ascent (see [7], [19], [20], [21] and [22]). These distributed learning algorithms typically involve each machine (or worker) performing a number of optimization steps to approximately minimize the global objective function using the local data that it has available.…”

Section: Distributed Stochastic Learningmentioning

confidence: 99%

“…The convergence behavior of the distributed SCD algorithm can be improved by optimizing the aggregation step. Existing work has considered both averaging and adding of updates [24], introducing an aggregation parameter that can be set freely [25] and even performing a line search method to explicitly optimize the aggregation parameter [21]. We propose a new method to optimize aggregation for distributed ridge regression whereby an optimal value of an aggregation parameter is precisely computed in a distributed manner.…”

Section: B Adaptive Aggregationmentioning

confidence: 99%

Large-scale stochastic learning using GPUs

Parnell

Duenner

Atasu

et al. 2017

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

In this work we propose an accelerated stochastic learning system for very large-scale applications. Acceleration is achieved by mapping the training algorithm onto massively parallel processors: we demonstrate a parallel, asynchronous GPU implementation of the widely used stochastic coordinate descent/ascent algorithm that can provide up to 35× speedup over a sequential CPU implementation. In order to train on very large datasets that do not fit inside the memory of a single GPU, we then consider techniques for distributed stochastic learning. We propose a novel method for optimally aggregating model updates from worker nodes when the training data is distributed either by example or by feature. Using this technique, we demonstrate that one can scale out stochastic learning across up to 8 worker nodes without any significant loss of training time. Finally, we combine GPU acceleration with the optimized distributed method to train on a dataset consisting of 200 million training examples and 75 million features. We show by scaling out across 4 GPUs, one can attain a high degree of training accuracy in around 4 seconds: a 20× speed-up in training time compared to a multi-threaded, distributed implementation across 4 CPUs.

show abstract

“…Inspired by GLMNET and [34], the work of [3,18] introduced the idea of a block-diagonal Hessian upper approximation in the distributed L 1 context. The later work of [29] specialized this approach to sparse logistic regression.…”

Section: Related Workmentioning

confidence: 99%

“…If hypothetically each of our quadratic subproblems G σ k (∆α [k] ) as defined in (2) were to be minimized exactly, the resulting steps could be interpreted as block-wise Newton-type steps on each coordinate block k, where the Newton-subproblem is modified to also contain the L 1 -regularizer [18,34,23]. While [18] allows a fixed accuracy for these subproblems-but not arbitrary approximation quality Θ as in our framework-the work of [29,34,31] assumes that the quadratic subproblems are solved exactly. Therefore, these methods are not able to freely trade off communication and computation.…”

Section: Related Workmentioning

confidence: 99%

L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework

Smith,

Forte,

Jordan

et al. 2015

Preprint

View full text Add to dashboard Cite

Despite the importance of sparsity in many large-scale applications, there are few methods for distributed optimization of sparsity-inducing objectives. In this paper, we present a communication-efficient framework for L 1 -regularized optimization in the distributed environment. By viewing classical objectives in a more general primal-dual setting, we develop a new class of methods that can be efficiently distributed and applied to common sparsity-inducing models, such as Lasso, sparse logistic regression, and elastic net-regularized problems. We provide theoretical convergence guarantees for our framework, and demonstrate its efficiency and flexibility with a thorough experimental comparison on Amazon EC2. Our proposed framework yields speedups of up to 50× as compared to current state-of-the-art methods for distributed L 1 -regularized optimization.

show abstract

Distributed Coordinate Descent for L1-regularized Logistic Regression

Cited by 4 publications

References 7 publications

Data Sampling Strategies in Stochastic Algorithms for Empirical Risk Minimization

Data Sampling Strategies in Stochastic Algorithms for Empirical Risk Minimization

Large-scale stochastic learning using GPUs

L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework

Contact Info

Product

Resources

About