Semi-Stochastic Gradient Descent Methods

Konečný, Jakub; Richtárik, Peter

doi:10.48550/arxiv.1312.1666

Cited by 59 publications

(111 citation statements)

References 0 publications

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…As is well-known, the LASSO penalty is P(W ) = λ||W || 1 , which serves as a convex counterpart to the non-convex L 0 norm. Kakade et al [12] showed that the Rademacher averages for linear regressors with L 1 penalty are bounded by R n (F W ) ≤ XW max property, assuming the Lipschitz constant of the LASSO problem to be L (more details and methods to compute it may be found in [14,13]) to obtain the bound of Rademacher averages of the loss function class G. Using the symmetrization lemma, this yields a bound on the expected maximum error (uniform deviation), as given by Eq. ( 2).…”

Section: Sparsity Through L 1 Regularization (Lasso)mentioning

confidence: 99%

Sparse Regression and Adaptive Feature Generation for the Discovery of Dynamical Systems

Kulkarni

Gupta

Lermusiaux

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We study the performance of sparse regression methods and propose new techniques to distill the governing equations of dynamical systems from data. We first look at the generic methodology of learning interpretable equation forms from data, proposed by Brunton et al. [3], followed by performance of LASSO for this purpose. We then propose a new algorithm that uses the dual of LASSO optimization for higher accuracy and stability. In the second part, we propose a novel algorithm that learns the candidate function library in a completely data-driven manner to distill the governing equations of the dynamical system. This is achieved via sequentially thresholded ridge regression (STRidge [17]) over a orthogonal polynomial space. The performance of the three discussed methods is illustrated by looking the Lorenz 63 system and the quadratic Lorenz system.

show abstract

Section: Sparsity Through L 1 Regularization (Lasso)mentioning

confidence: 99%

Sparse Regression and Adaptive Feature Generation for the Discovery of Dynamical Systems

Kulkarni

Gupta

Lermusiaux

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…A. Geometric Sampling/Averaging Scheme Instead of choosing x s+1 according to option I or II in Algorithm 1, inspired by [33], we can introduce a "forgetting" effect by considering two other schemes: option III: Sample τ s randomly from [m] from the distribution Q (β m−1 /c, β m−2 /c, . .…”

Section: Acceleration Strategiesmentioning

confidence: 99%

“…(85) Define δ s,t,r v s,t,r − ∇ f r ( x s,t,r ). We can derive an inequality similar to (33) in the proof of Theorem 1, i.e.,…”

Section: Appendix C Proof Of Propositionmentioning

confidence: 99%

Stochastic L-BFGS: Improved Convergence Rates and Practical Acceleration Strategies

Zhao,

Haskell,

Tan

2017

Preprint

View full text Add to dashboard Cite

We revisit the stochastic limited-memory BFGS (L-BFGS) algorithm. By proposing a new coordinate transformation framework for the convergence analysis, we prove improved convergence rates and computational complexities of the stochastic L-BFGS algorithms compared to previous works. In addition, we propose several practical acceleration strategies to speed up the empirical performance of such algorithms. We also provide theoretical analyses for most of the strategies. Experiments on large-scale logistic and ridge regression problems demonstrate that our proposed strategies yield significant improvements visà-vis competing state-of-the-art algorithms.

show abstract

“…As a result, the computational burden of GD is alleviated by stochastic gradients, while the gradient estimator variance can be also reduced using snapshot gradients. Members of the variance reduction family include those abbreviated as SDCA [5], SVRG [6][7][8], SAG [9], SAGA [10,11], MISO [12], S2GD [13], SCSG [14] and SARAH [15,16]. Most of these rely on the update x t+1 = x t − ηv t , where η is a constant step size and v t is a carefully designed gradient estimator that takes advantage of the snapshot gradient.…”

Section: Introductionmentioning

confidence: 99%

On the Convergence of SARAH and Beyond

Li¹,

Ma²,

Giannakis³

2019

Preprint

View full text Add to dashboard Cite

The main theme of this work is a unifying algorithm, abbreviated as L2S, that can deal with (strongly) convex and nonconvex empirical risk minimization (ERM) problems. It broadens a recently developed variance reduction method known as SARAH. L2S enjoys a linear convergence rate for strongly convex problems, which also implies the last iteration of SARAH's inner loop converges linearly. For convex problems, different from SARAH, L2S can afford step and mini-batch sizes not dependent on the data size n, and the complexity needed to guaranteeFor nonconvex problems on the other hand, the complexity is O(n + √ n/ ). Parallel to L2S there are a few side results. Leveraging an aggressive step size, D2S is proposed, which provides a more efficient alternative to L2S and SARAH-like algorithms. Specifically, D2S requires a reduced IFO complexity of O (n + κ) ln(1/ ) for strongly convex problems. Moreover, to avoid the tedious selection of the optimal step size, an automatic tuning scheme is developed, which obtains comparable empirical performance with SARAH using judiciously tuned step size.

show abstract

Semi-Stochastic Gradient Descent Methods

Cited by 59 publications

References 0 publications

Sparse Regression and Adaptive Feature Generation for the Discovery of Dynamical Systems

Sparse Regression and Adaptive Feature Generation for the Discovery of Dynamical Systems

Stochastic L-BFGS: Improved Convergence Rates and Practical Acceleration Strategies

On the Convergence of SARAH and Beyond

Contact Info

Product

Resources

About