Inspecting the concept knowledge graph encoded by modern language models

Aspillaga, Carlos; Mendoza, Marcelo; Soto, Álvaro

doi:10.18653/v1/2021.findings-acl.263

“…However, it is not clear in which layers it is crucial to have the PC mechanism. We hypothesize that this is related to the fact that the BERT-style models encode syntactic and semantic features in different layers (Jawahar et al, 2019;Aspillaga et al, 2021), so a specialized PC mechanism for syntax or semantics would be desirable. We left this study for future work.…”

Section: Ablation Studymentioning

confidence: 99%

Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations

Araujo¹,

Villa²,

Mendoza³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.

show abstract

“…In particular, they found that neurons in XLNet are more localized in encoding individual linguistic information compared to BERT, where neurons are shared across multiple properties. By adopting the method of Hewitt and Manning (2019), Aspillaga et al (2021) investigated whether pre-trained language models encode semantic information, for instance by checking their representations against the lexico-semantic structure of WordNet (Miller, 1994).…”

Section: Related Workmentioning

confidence: 99%

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids' Representations

Fayyaz¹,

Aghazadeh²,

M³

et al. 2021

Preprint

0

View full text Add to dashboard Cite

Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic information in the representations. Most notably, we observe that ELECTRA tends to encode linguistic knowledge in the deeper layers, whereas XLNet instead concentrates that in the earlier layers. Also, the former model undergoes a slight change during fine-tuning, whereas the latter experiences significant adjustments. Moreover, we show that drawing conclusions based on the weight mixing evaluation strategy-which is widely used in the context of layer-wise probing-can be misleading given the norm disparity of the representations across different layers. Instead, we adopt an alternative information-theoretic probing with minimum description length, which has recently been proven to provide more reliable and informative results.

show abstract

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids’ Representations

Fayyaz¹,

Aghazadeh²,

Modarressi³

et al. 2021

Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

View full text Add to dashboard Cite

Most of the recent works on probing representations have focused on BERT, with the presumption that the findings might be similar to the other models. In this work, we extend the probing studies to two other models in the family, namely ELECTRA and XLNet, showing that variations in the pre-training objectives or architectural choices can result in different behaviors in encoding linguistic information in the representations. Most notably, we observe that ELECTRA tends to encode linguistic knowledge in the deeper layers, whereas XLNet instead concentrates that in the earlier layers. Also, the former model undergoes a slight change during fine-tuning, whereas the latter experiences significant adjustments. Moreover, we show that drawing conclusions based on the weight mixing evaluation strategy-which is widely used in the context of layer-wise probing-can be misleading given the norm disparity of the representations across different layers. Instead, we adopt an alternative information-theoretic probing with minimum description length, which has recently been proven to provide more reliable and informative results.

show abstract

Inspecting the concept knowledge graph encoded by modern language models

Cited by 3 publications

References 45 publications

Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations

Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids' Representations

Not All Models Localize Linguistic Knowledge in the Same Place: A Layer-wise Probing on BERToids’ Representations

Contact Info

Product

Resources

About