Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Wang, Kevin; Variengien, Alexandre; Conmy, Arthur; Shlegeris, Buck; Steinhardt, Jacob

doi:10.48550/arxiv.2211.00593

Cited by 9 publications

(16 citation statements)

References 5 publications

(5 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we studied the behavior of small transformers on a simple algorithmic task, solved with a single circuit. On the other hand, larger models use larger, more numerous circuits to solve significantly harder tasks (Cammarata et al, 2020;Wang et al, 2022). The analysis reported in this work required significant amounts of manual effort, and our progress metrics are specific to small networks on one particular algorithmic task.…”

Section: Conclusion and Discussionmentioning

confidence: 96%

“…2. If future mechanistic interpretability can only recover parts of the mechanism of larger models (as in Wang et al (2022)) and can only generate comprehensive understanding of the mechanisms of smaller models, we might still be able to use our understanding from smaller models to guide the development measures that track parts of the behavior of the larger model. We find this scenario relatively plausible, as existing mechanistic interpretability work already allows us to recover fragments of large model behavior and understand these fragments by analogy to smaller models.…”

Section: F Further Discussion On Using Mechanistic Interpretability A...mentioning

confidence: 99%

See 1 more Smart Citation

Progress measures for grokking via mechanistic interpretability

Nanda¹,

Chan²,

Tom³

et al. 2023

Preprint

View full text Add to dashboard Cite

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous progress measures that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverseengineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of "grokking" exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

show abstract

Section: Conclusion and Discussionmentioning

confidence: 96%

Section: F Further Discussion On Using Mechanistic Interpretability A...mentioning

confidence: 99%

Progress measures for grokking via mechanistic interpretability

Nanda¹,

Chan²,

Tom³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…However, current language models usually map a sequence of input tokens to a probability distribution over the next token. Circuits in real models often consist of components that increase or decrease the probability of some tokens based on previous tokens (Wang et al, 2022). RASP, and hence Tracr, cannot model such "probabilistic" computation, but could potentially be extended to support it.…”

Section: Limitations Of Rasp and Tracrmentioning

confidence: 99%

“…Cammarata et al (2020) explain a range of specific circuits in InceptionV1 (Szegedy et al, 2015), including curve detectors, high-low frequency detectors, and neurons detecting more high-level concepts such as dogs or cars. Elhage et al (2021) and Wang et al (2022) achieve early success in interpreting transformer language models using similar methods.…”

Section: Introductionmentioning

confidence: 99%

Tracr: Compiled Transformers as a Laboratory for Interpretability

Lindner¹,

Kramár²,

Matthew³

et al. 2023

Preprint

View full text Add to dashboard Cite

DeepMind, * Work done at DeepMind. Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al., 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. We study the resulting models and discuss how this approach can accelerate interpretability research. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.

show abstract

“…Causal Abstraction for Explanations of AI Geiger et al (2023) argue that causal abstraction is a generic theoretical framework for providing faithful Lyu et al, 2022) and interpretable (Lipton, 2018) explanations of AI models and show that LIME (Ribeiro et al, 2016), causal effect estimation (Abraham et al, 2022;Feder et al, 2021), causal mediation analysis (Vig et al, 2020;Csordás et al, 2021;De Cao et al, 2021), iterated nullspace projection (Ravfogel et al, 2020;Elazar et al, 2020), and circuit-based explanations (Olah et al, 2020;Olsson et al, 2022;Wang et al, 2022) can all be seen as special cases of causal abstraction analysis. The circuits research program also posits that a linear combination of neural activations, which they term a 'feature', is the fundamental unit in neural networks.…”

Section: Related Workmentioning

confidence: 99%

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Geiger¹,

Wu²,

Potts³

et al. 2023

Preprint

View full text Add to dashboard Cite

Causal abstraction is a promising theoretical framework for explainable artificial intelligence that defines when an interpretable high-level causal model is a faithful simplification of a lowlevel deep learning system. However, existing causal abstraction methods have two major limitations: they require a brute-force search over alignments between the high-level model and the low-level one, and they presuppose that variables in the high-level model will align with disjoint sets of neurons in the low-level one. In this paper, we present distributed alignment search (DAS), which overcomes these limitations. In DAS, we find the alignment between high-level and low-level models using gradient descent rather than conducting a brute-force search, and we allow individual neurons to play multiple distinct roles by analyzing representations in nonstandard bases-distributed representations. Our experiments show that DAS can discover internal structure that prior approaches miss. Overall, DAS removes previous obstacles to conducting causal abstraction analyses and allows us to find conceptual structure in trained neural nets.

show abstract

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Cited by 9 publications

References 5 publications

Progress measures for grokking via mechanistic interpretability

Progress measures for grokking via mechanistic interpretability

Tracr: Compiled Transformers as a Laboratory for Interpretability

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Contact Info

Product

Resources

About