Parallel Boosting with Momentum

Mukherjee, Indraneel; Canini, Kevin Robert; Frongillo, Rafael M.; Singer, Yoram

doi:10.1007/978-3-642-40994-3_2

Cited by 19 publications

(19 citation statements)

References 11 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The training points were generated randomly as described in [13], with N = 7000 and n = 50. To establish a reference benchmark with a well known algorithm, we used the particular implementation [13] of one of the coordinate descent (CD) methods of Tseng and Yun [26]. Figure 1 reports the performance of SGD (with β = 7) and SQN (with β = 2), as measured by accessed data points.…”

Section: Experiments With Synthetic Datasetsmentioning

confidence: 99%

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Byrd¹,

Hansen²,

Nocedal³

et al. 2016

SIAM J. Optim.

328

299

View full text Add to dashboard Cite

The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi-Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients at every iteration, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.

show abstract

Section: Experiments With Synthetic Datasetsmentioning

confidence: 99%

A Stochastic Quasi-Newton Method for Large-Scale Optimization

Byrd¹,

Hansen²,

Nocedal³

et al. 2016

SIAM J. Optim.

328

299

View full text Add to dashboard Cite

show abstract

“…Parallel methods were considered in [2,19,21], and more recently in [1,5,6,12,13,25,27,28]. A memory distributed method scaling to big data problems was recently developed in [22].…”

Section: Literaturementioning

confidence: 99%

On optimal probabilities in stochastic coordinate descent methods

2015

View full text Add to dashboard Cite

We propose and analyze a new parallel coordinate descent methodNSync-in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen using an arbitrary probability law. This is the first method of this type. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iteration-with optimal probabilities-may require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.

show abstract

“…-an asynchronous version of Parallel Coordinate Descent with τ -independent sampling (τ = 16) (Algorithm 2) Comparison of algorithms for the resolution of the Adaboost problem on the URL reputation dataset with 16 processors (same colours as in Figure 1). based on the code of [11] which is freely available; the τ -independent sampling is a good approximation of the τ -nice sampling for τ ≪ n, -the fully parallel coordinate descent method [6], -the accelerated version of the fully parallel coordinate descent method [7], -the classical Adaboost algorithm (greedy coordinate descent); we performed the search for the largest absolute value of the gradient in parallel.…”

Section: Numerical Experimentsmentioning

confidence: 99%

Parallel Coordinate Descent for the Adaboost Problem

Fercoq

2013

2013 12th International Conference on Machine Learning and Applications

View full text Add to dashboard Cite

We design a randomised parallel version of Adaboost based on previous studies on parallel coordinate descent. The algorithm uses the fact that the logarithm of the exponential loss is a function with coordinate-wise Lipschitz continuous gradient, in order to define the step lengths. We provide the proof of convergence for this randomised Adaboost algorithm and a theoretical parallelisation speedup factor. We finally provide numerical examples on learning problems of various sizes that show that the algorithm is competitive with concurrent approaches, especially for large scale problems.

show abstract

Parallel Boosting with Momentum

Cited by 19 publications

References 11 publications

A Stochastic Quasi-Newton Method for Large-Scale Optimization

A Stochastic Quasi-Newton Method for Large-Scale Optimization

On optimal probabilities in stochastic coordinate descent methods

Parallel Coordinate Descent for the Adaboost Problem

Contact Info

Product

Resources

About