2022
DOI: 10.48550/arxiv.2207.08799
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 0 publications
1
5
0
Order By: Relevance
“…Millidge (2022) suggests that this may be due to SGD being a random walk on the optimal manifold. Our results echo Barak et al (2022) in showing that the network instead makes continuous progress toward the generalizing algorithm. Liu et al (2022) construct small examples of grokking, which they use to compute phase diagrams with four separate "phases" of learning.…”
Section: Related Worksupporting
confidence: 63%
See 2 more Smart Citations
“…Millidge (2022) suggests that this may be due to SGD being a random walk on the optimal manifold. Our results echo Barak et al (2022) in showing that the network instead makes continuous progress toward the generalizing algorithm. Liu et al (2022) construct small examples of grokking, which they use to compute phase diagrams with four separate "phases" of learning.…”
Section: Related Worksupporting
confidence: 63%
“…Progress measures. Barak et al (2022) introduce the notion of progress measures-metrics that improve smoothly and that precede emergent behavior. They prove theoretically that training would amplify a certain mechanism and heuristically define a progress measure.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Work in mechanistic interpretability aims to discover, understand, and verify the algorithms that model weights implement by reverse engineering model computation into human-understandable components (Olah, 2022;Meng et al, 2022;Geiger et al, 2021;Geva et al, 2020). By understanding underlying mechanisms, we can better predict out-of-distribution behavior (Mu & Andreas, 2020), identify and fix model errors Vig et al, 2020), and understand emergent behavior (Nanda & Lieberum, 2022;Barak et al, 2022;Wei et al, 2022).…”
Section: Introductionmentioning
confidence: 99%
“…An early work with high relevance to the present work is (Wei et al, 2018), which in addition to establishing that the NTK requires Ω(d 2 / ) samples whereas O(d/ ) suffice for the global maximum margin solution, also provided a noisy Wasserstein Flow (WF) analysis which achieved the maximum margin solution, albeit using noise, infinite width, and continuous time to aid in local search, The global maximum margin work of Chizat and Bach (2020) was mentioned before, and will be discussed in Section 3. The work of Barak et al (2022) uses a two phase algorithm: the first step has a large minibatch and effectively learns the support of the parity in an unsupervised manner, and thereafter only the second layer is trained, a convex problem which is able to identify the signs within the parity; as in Table 1, this work stands alone in terms of the narrow width it can handle. The work of (Abbe et al, 2022) uses a similar two-phase approach, and while it can not learn precisely the parity, it can learn an interesting class of "staircase" functions, and presents many valuable proof techniques.…”
Section: Further Related Workmentioning
confidence: 99%