2013
DOI: 10.48550/arxiv.1312.1666
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Semi-Stochastic Gradient Descent Methods

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
107
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
3
2

Relationship

3
5

Authors

Journals

citations
Cited by 59 publications
(111 citation statements)
references
References 0 publications
4
107
0
Order By: Relevance
“…As is well-known, the LASSO penalty is P(W ) = λ||W || 1 , which serves as a convex counterpart to the non-convex L 0 norm. Kakade et al [12] showed that the Rademacher averages for linear regressors with L 1 penalty are bounded by R n (F W ) ≤ XW max property, assuming the Lipschitz constant of the LASSO problem to be L (more details and methods to compute it may be found in [14,13]) to obtain the bound of Rademacher averages of the loss function class G. Using the symmetrization lemma, this yields a bound on the expected maximum error (uniform deviation), as given by Eq. ( 2).…”
Section: Sparsity Through L 1 Regularization (Lasso)mentioning
confidence: 99%
“…As is well-known, the LASSO penalty is P(W ) = λ||W || 1 , which serves as a convex counterpart to the non-convex L 0 norm. Kakade et al [12] showed that the Rademacher averages for linear regressors with L 1 penalty are bounded by R n (F W ) ≤ XW max property, assuming the Lipschitz constant of the LASSO problem to be L (more details and methods to compute it may be found in [14,13]) to obtain the bound of Rademacher averages of the loss function class G. Using the symmetrization lemma, this yields a bound on the expected maximum error (uniform deviation), as given by Eq. ( 2).…”
Section: Sparsity Through L 1 Regularization (Lasso)mentioning
confidence: 99%
“…A. Geometric Sampling/Averaging Scheme Instead of choosing x s+1 according to option I or II in Algorithm 1, inspired by [33], we can introduce a "forgetting" effect by considering two other schemes: option III: Sample τ s randomly from [m] from the distribution Q (β m−1 /c, β m−2 /c, . .…”
Section: Acceleration Strategiesmentioning
confidence: 99%
“…(85) Define δ s,t,r v s,t,r − ∇ f r ( x s,t,r ). We can derive an inequality similar to (33) in the proof of Theorem 1, i.e.,…”
Section: Appendix C Proof Of Propositionmentioning
confidence: 99%
“…As a result, the computational burden of GD is alleviated by stochastic gradients, while the gradient estimator variance can be also reduced using snapshot gradients. Members of the variance reduction family include those abbreviated as SDCA [5], SVRG [6][7][8], SAG [9], SAGA [10,11], MISO [12], S2GD [13], SCSG [14] and SARAH [15,16]. Most of these rely on the update x t+1 = x t − ηv t , where η is a constant step size and v t is a carefully designed gradient estimator that takes advantage of the snapshot gradient.…”
Section: Introductionmentioning
confidence: 99%