2021
DOI: 10.48550/arxiv.2108.00781
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Generalization Properties of Stochastic Optimizers via Trajectory Analysis

Abstract: Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms on generalization performance in realistic non-convex settings is still poorly understood. In this paper, we provide an encompassing theoretical framework for investigating the generalization properties of stochastic optimizers, which is based on their dynamics. We first prove a generalization bound attributable to the optimizer dynamics in terms of the celebrated Fernique-Talagrand func… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(8 citation statements)
references
References 43 publications
0
8
0
Order By: Relevance
“…The proof of the part (ii) follows similarly. Now, we proceed to show the inequality (16). We show that under any distribution Q defined over W ˆśiPrKs pS i ˆWi q, there exist proper choices of P Ŵ i|Si,W1:Kzi and 0-1 loss function ˆ pz, ŵi q such that…”
Section: Proof Of Theoremmentioning
confidence: 97%
See 1 more Smart Citation
“…The proof of the part (ii) follows similarly. Now, we proceed to show the inequality (16). We show that under any distribution Q defined over W ˆśiPrKs pS i ˆWi q, there exist proper choices of P Ŵ i|Si,W1:Kzi and 0-1 loss function ˆ pz, ŵi q such that…”
Section: Proof Of Theoremmentioning
confidence: 97%
“…Common approaches to studying the generalization error of a statistical learning algorithm often consider the effective hypothesis space induced by the algorithm, rather than the entire hypothesis space, or the information leakage about the training dataset. Examples include information-theoretic (mutual information) approaches [3, 4, 5, 6, 7, 8], compression-based approaches [9, 10, 11, 12, 13] and intrinsic-dimension or fractal based approaches [14,15,16]. Recently, a novel approach [17] that generalizes the notion of algorithm compressibility by using lossy covering from source coding concepts was used to show that the compression error rate of an algorithm is strongly connected to its generalization error both in expectation and with high probability; and, consequently, establish new rate-distortion-based bounds on the generalization error.…”
mentioning
confidence: 99%
“…This result reveals a similar phenomenon to that of [53] and [50] as discussed above. Furthermore, our results do not require any non-trivial assumptions, compared to the existing heavy-tailed generalization bounds [3,23,46].…”
Section: Introductionmentioning
confidence: 92%
“…They showed that, under several assumptions on the SDE (4), the worst-case generalization error over the trajectory, i.e., sup t∈[0,1] | F (θ t , X) − F (θ t )| scales with the intrinsic dimension of the trajectory (θ t ) t∈[0,1] , which is then upper-bounded as a particular function of the tail-exponent around a local minimum, indicating that heavier tails imply lower generalization error. Their results were later extended to discrete-time recursions as well in [23]. More recently, [3] linked heavy-tails to generalization through a notion of compressibility in the over-parameterized regimes.…”
Section: Introductionmentioning
confidence: 99%
“…where σ > 0 is a scale parameter, L α t is a d-dimensional α-stable Lévy process, which has heavytailed increments and will be formally defined in the next section 1 , and α ∈ (0, 2] denotes the 'tail-exponent' such that as α gets smaller the process L α t becomes heavier-tailed. Within this mathematical framework, Şimşekli et al [27] proved an upper-bound (which was then improved in [14]) for the worst-case generalization error over the trajectories of (4). The bound informally reads as follows: with probability at least 1 − δ, it holds that…”
Section: Introductionmentioning
confidence: 99%