2020
DOI: 10.1088/1742-5468/abc61d
|View full text |Cite
|
Sign up to set email alerts
|

Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm

Abstract: How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as n −β where n is the number of training examples and β is an exponent that depends on both data and algorithm. In this work we measure β when applying kernel methods to real datasets. For MNIST we find β ≈ 0.4 and for CIFAR10 β ≈ 0.1, for both regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
69
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 36 publications
(81 citation statements)
references
References 28 publications
4
69
0
Order By: Relevance
“…The closest approach to the work in this paper is probably the one presented in the works of [25][26][27], where average-case learning curves for GPR are derived under the assumption that the model is correctly specified, with recent extensions to kernel regression focusing on noiseless data sets [20,36] and Gaussian design analysis [48]. A related but complementary line of work studies the convergence rates and posterior consistency properties of Bayesian non-parametric models [33,34,49].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The closest approach to the work in this paper is probably the one presented in the works of [25][26][27], where average-case learning curves for GPR are derived under the assumption that the model is correctly specified, with recent extensions to kernel regression focusing on noiseless data sets [20,36] and Gaussian design analysis [48]. A related but complementary line of work studies the convergence rates and posterior consistency properties of Bayesian non-parametric models [33,34,49].…”
Section: Related Workmentioning
confidence: 99%
“…Learning curves A large-scale empirical characterization of the generalization performance of state-of-the-art deep NNs showed that the associated learning curves often follow a power law of the form n −β with the exponent β ranging between 0.07 and 0.35 depending on the data and the algorithm [19,20]. Power-law asymptotics of learning curves have been theoretically studied in early works for the Gibbs learning algorithm [21][22][23] that showed a generalization error scaling with exponent β = 0.5, 1 or 2 under certain assumptions.…”
Section: Introductionmentioning
confidence: 99%
“…Both diagrams also seem to show that at very large H, and starting from intermediate values of the number of samples, the differences between TF, 2L and RF seem to narrow. This type of behavior is expected, since by growing the width of the hidden layer one eventually approaches the kernel regime [19].…”
Section: Discussionmentioning
confidence: 87%
“…We find that β = α t /s if t ≤ s and α t ≤ 2(α s + s). This approach is non-rigorous, but it can be proven for Gaussian fields if data are sampled on a lattice [4]. It also corresponds to a provable lower bound on the error when teacher and student are equal [20].…”
Section: Our Contributionsmentioning
confidence: 99%
“…(P ) ∼ P −1/d . Nonetheless, empirical evidence shows that the curse of dimensionality is beaten in practice [2,3,4], with (P ) ∼ P −β , β 1/d.…”
Section: Introductionmentioning
confidence: 99%