Neuron-level Interpretation of Deep NLP Models: A Survey

Hassan, Sazzad; Durrani, Nadir; Dalvi, Fahim

doi:10.48550/arxiv.2108.13138

Cited by 3 publications

(4 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…See also a number of previous surveys and critiques of interpretability work that have overlap with ours [3], [58], [60], [68], [95], [118], [136], [173]- [175], [208], [215], [218], [219]. This survey, however, is distinct in its focus on inner interpretability, AI safety, and the intersections between interpretability and several other research paradigms.…”

Section: Scope and Taxonomymentioning

confidence: 98%

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Tilman¹,

Ho²,

Casper³

et al. 2022

Preprint

View full text Add to dashboard Cite

The last decade of machine learning has seen drastic increases in scale and capabilities, and deep neural networks (DNNs) are increasingly being deployed across a wide range of domains. However, the inner workings of DNNs are generally difficult to understand, raising concerns about the safety of using these systems without a rigorous understanding of how they function. In this survey, we review literature on techniques for interpreting the inner components of DNNs, which we call inner interpretability methods. Specifically, we review methods for interpreting weights, neurons, subnetworks, and latent representations with a focus on how these techniques relate to the goal of designing safer, more trustworthy AI systems. We also highlight connections between interpretability and work in modularity, adversarial robustness, continual learning, network compression, and studying the human visual system. Finally, we discuss key challenges and argue for future work in interpretability for AI safety that focuses on diagnostics, benchmarking, and robustness.

show abstract

Section: Scope and Taxonomymentioning

confidence: 98%

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Tilman¹,

Ho²,

Casper³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The closest survey related to our work is from Sajjad et al [25], where the survey is on fine-grained neuron analysis. While there have been two previous surveys that cover Concept Analysis [26] and Attribution Analysis [24], their focus is on analyzing individual neurons to better understand the inner workings of neural networks.…”

Section: Related Surveysmentioning

confidence: 99%

“…A common observation that we see in the contemporary general surveys and from our focused reviews is the lack of both theoretical foundations and empirical considerations in evaluations [25,23,24]. Even though each method has quantitative measures for evaluation, there is no standard set of metrics for comparing various observations, hence, confining the scope of respective interpretability technique results to specific model architectures or task-related domains.…”

Section: Insights and Future Directionsmentioning

confidence: 99%

Interpretability in Activation Space Analysis of Transformers: A Focused Survey

Vijayakumar¹

2023

Preprint

View full text Add to dashboard Cite

The field of natural language processing has reached breakthroughs with the advent of transformers. They have remained state-of-the-art since then, and there also has been much research in analyzing, interpreting, and evaluating the attention layers and the underlying embedding space. In addition to the self-attention layers, the feed-forward layers in the transformer are a prominent architectural component. From extensive research, we observe that its role is under-explored. We focus on the latent space, known as the Activation Space, that consists of the neuron activations from these feed-forward layers. In this survey paper, we review interpretability methods that examine the learnings that occurred in this activation space. Since there exists only limited research in this direction, we conduct a detailed examination of each work and point out potential future directions of research. We hope our work provides a step towards strengthening activation space analysis.

show abstract

“…Moreover, enforcing neuron activation sparsity in MLPs helps to improve the percentage of neurons that are interpretable [49]. Hence, our discovery may point to new directions towards developing more interpretable DNNs [50,51].…”

Section: Sparsity For Robustnessmentioning

confidence: 99%

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

Li¹,

You²,

Bhojanapalli³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

show abstract

Neuron-level Interpretation of Deep NLP Models: A Survey

Cited by 3 publications

References 24 publications

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Interpretability in Activation Space Analysis of Transformers: A Focused Survey

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

Contact Info

Product

Resources

About