2023
DOI: 10.48550/arxiv.2301.05217
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Progress measures for grokking via mechanistic interpretability

Abstract: Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous progress measures that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverseengineering learned behaviors into their individual components. As a case study, we investigate the recently-discovere… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(17 citation statements)
references
References 12 publications
(22 reference statements)
0
8
0
Order By: Relevance
“…This implies that there are certain prompts that can modify the processing in unexpected ways based on the procedure of how the AI is trained. This is still poorly understood since to date there is yet no clear understanding how these emergent properties awaken from the mathematical operations within the artificial neural networks, which is currently the object of research in a discipline called Mechanistic Interpretability (Conmy et al, 2023;Nanda et al, 2023;Zimmermann et al, 2023).…”
Section: Totmentioning
confidence: 99%
“…This implies that there are certain prompts that can modify the processing in unexpected ways based on the procedure of how the AI is trained. This is still poorly understood since to date there is yet no clear understanding how these emergent properties awaken from the mathematical operations within the artificial neural networks, which is currently the object of research in a discipline called Mechanistic Interpretability (Conmy et al, 2023;Nanda et al, 2023;Zimmermann et al, 2023).…”
Section: Totmentioning
confidence: 99%
“…Relationship with Circuits A common theme in mechanistic interpretability, especially when it comes to explaining the grokking phenomenon, is the idea of 'circuit' formation during training (Nanda et al, 2023;Varma et al, 2023;Olah et al, 2020) of the network in a region-wise fashion, i.e., for all input vectors {x : x ∈ ω}, the network performs the same affine operation using parameters (A ω , b ω ) while mapping x to the output. The affine parameters for any given region, are a function of the active neurons in the network as was shown by Humayun et al (2023a) (Lemma 1).…”
Section: Measuring Local Complexity Using the Deepmentioning
confidence: 99%
“…Our novel measure does not rely on the dataset, labels, or loss function that is used during training. It behaves as a progress measure (Barak et al, 2022;Nanda et al, 2023) We summarize the contributions as follows:…”
Section: Introductionmentioning
confidence: 99%
“…These kinds of models are hard to interpret due to their complexity, irrespective of the soundness of the statistical foundations on which they are built. For instance, proving that a neural network is a universal function approximator (Hornik et al 1989) is scant consolation for the fact that humans can only make sense of the inner workings of a trained neural network model through laborious analysis that resembles experimental biology more than mathematics (this reverse engineering work constitutes the newborn field of "mechanistic interpretability;" see, e.g., Olah et al 2017;Carter et al 2019;Nanda et al 2023). Second, the model would have to be trained on simulations.…”
Section: Introductionmentioning
confidence: 99%