2022
DOI: 10.48550/arxiv.2211.00593
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Abstract: Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 mai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(16 citation statements)
references
References 5 publications
(5 reference statements)
1
11
0
Order By: Relevance
“…In this work, we studied the behavior of small transformers on a simple algorithmic task, solved with a single circuit. On the other hand, larger models use larger, more numerous circuits to solve significantly harder tasks (Cammarata et al, 2020;Wang et al, 2022). The analysis reported in this work required significant amounts of manual effort, and our progress metrics are specific to small networks on one particular algorithmic task.…”
Section: Conclusion and Discussionmentioning
confidence: 96%
See 1 more Smart Citation
“…In this work, we studied the behavior of small transformers on a simple algorithmic task, solved with a single circuit. On the other hand, larger models use larger, more numerous circuits to solve significantly harder tasks (Cammarata et al, 2020;Wang et al, 2022). The analysis reported in this work required significant amounts of manual effort, and our progress metrics are specific to small networks on one particular algorithmic task.…”
Section: Conclusion and Discussionmentioning
confidence: 96%
“…2. If future mechanistic interpretability can only recover parts of the mechanism of larger models (as in Wang et al (2022)) and can only generate comprehensive understanding of the mechanisms of smaller models, we might still be able to use our understanding from smaller models to guide the development measures that track parts of the behavior of the larger model. We find this scenario relatively plausible, as existing mechanistic interpretability work already allows us to recover fragments of large model behavior and understand these fragments by analogy to smaller models.…”
Section: F Further Discussion On Using Mechanistic Interpretability A...mentioning
confidence: 99%
“…However, current language models usually map a sequence of input tokens to a probability distribution over the next token. Circuits in real models often consist of components that increase or decrease the probability of some tokens based on previous tokens (Wang et al, 2022). RASP, and hence Tracr, cannot model such "probabilistic" computation, but could potentially be extended to support it.…”
Section: Limitations Of Rasp and Tracrmentioning
confidence: 99%
“…Cammarata et al (2020) explain a range of specific circuits in InceptionV1 (Szegedy et al, 2015), including curve detectors, high-low frequency detectors, and neurons detecting more high-level concepts such as dogs or cars. Elhage et al (2021) and Wang et al (2022) achieve early success in interpreting transformer language models using similar methods.…”
Section: Introductionmentioning
confidence: 99%
“…Causal Abstraction for Explanations of AI Geiger et al (2023) argue that causal abstraction is a generic theoretical framework for providing faithful Lyu et al, 2022) and interpretable (Lipton, 2018) explanations of AI models and show that LIME (Ribeiro et al, 2016), causal effect estimation (Abraham et al, 2022;Feder et al, 2021), causal mediation analysis (Vig et al, 2020;Csordás et al, 2021;De Cao et al, 2021), iterated nullspace projection (Ravfogel et al, 2020;Elazar et al, 2020), and circuit-based explanations (Olah et al, 2020;Olsson et al, 2022;Wang et al, 2022) can all be seen as special cases of causal abstraction analysis. The circuits research program also posits that a linear combination of neural activations, which they term a 'feature', is the fundamental unit in neural networks.…”
Section: Related Workmentioning
confidence: 99%