BERTnesia: Investigating the capture and forgetting of knowledge in BERT

Wallat, Jonas; Singh, Jaspreet; Anand, Avishek

doi:10.48550/arxiv.2106.02902

Cited by 3 publications

(6 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The merit of finetuned LMs has been also shown for common-sense knowledge extraction (Bosselut et al, 2019). Previous work also studies the effect of dataset size for finetuning (Wallat et al, 2021;Fichtel et al, 2021;Da et al, 2021), but the negative effects finetuning (studied in this paper) remain unexplored. For a full review of the literature on knowledge probing and extraction, we refer to (Safavi & Koutra, 2021;AlKhamissi et al, 2022).…”

Section: Related Workmentioning

confidence: 73%

“…While previous work typically explains the phenomanon in Figure 1 as forgetting effect (Wallat et al, 2021), our study reveals a more nuanced explanation in terms of Frequency Shock: even though both "Moscow" and "Baku" have been observed an equal number of times in the training set, since "Baku" is expected to be a less common entity 2 and hence less observed during the pre-training of the language model, the finetuned model receives a frequency shock leading to an over-prediction of the entity "Baku", hence corrupting an originally correct prediction. Note that Frequency Shock and Range Shift are related to the problem of out-of-distribution (OOD) generalization in machine learning, see section 3.6 for more discussion.…”

Section: Introductionmentioning

confidence: 95%

“…Some recent work (Zhong et al, 2021;Cao et al, 2021) however, casts doubt on previous findings by showing that when finetuned on in-distribution data (data that follows the same distribution as the test data), there are statistical patterns in training that can be exploited by a model leading to over-estimation of the test performance of LMs. Moreover, Wallat et al (2021) show that finetuning may lead to forgetting the previously known facts by the model. Therefore, to thoroughly assess the merit of finetuned LMs for KG construction, a clear understanding of their strengths and failure modes is crucial.…”

Section: Introductionmentioning

confidence: 98%

See 2 more Smart Citations

Understanding Finetuning for Factual Knowledge Extraction from Language Models

Mehran¹,

Mittal²,

Ramachandran³

2023

Preprint

View full text Add to dashboard Cite

Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful) phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1-model mixing and 2-mixture finetuning with the LM's pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.

show abstract

Section: Related Workmentioning

confidence: 73%

Section: Introductionmentioning

confidence: 95%

Section: Introductionmentioning

confidence: 98%

See 1 more Smart Citation

Understanding Finetuning for Factual Knowledge Extraction from Language Models

Mehran¹,

Mittal²,

Ramachandran³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…There are increasing evidence that show that scaling LMs to larger sizes is not the solution to generating factually correct information (Lazaridou et al, 2021;Gehman et al, 2020;Lin et al, 2021a). As a result, this would also result in catastrophic forgetting (Wallat et al, 2021). Changing a single weight may have a ripple effect that affects a large number of other implicitly memorized facts.…”

Section: Lms-as-kbsmentioning

confidence: 99%

“…commonsense question answering) so it can make way for the required knowledge to surface in the output during evaluation. Previous work has shown that most knowledge encoded in a LM are acquired during pretraining, while finetuning just learns an interface to access that acquired knowledge (Da et al, 2021;Wallat et al, 2021).…”

Section: Finetuningmentioning

confidence: 99%

A Review on Language Models as Knowledge Bases

Badr¹,

Li²,

Çelikyılmaz³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, there has been a surge of interest in the NLP community on the use of pretrained Language Models (LMs) as Knowledge Bases (KBs). Researchers have shown that LMs trained on a sufficiently large (web) corpus will encode a significant amount of knowledge implicitly in its parameters. The resulting LM can be probed for different kinds of knowledge and thus acting as a KB. This has a major advantage over traditional KBs in that this method requires no human supervision. In this paper, we present a set of aspects that we deem an LM should have to fully act as a KB, and review the recent literature with respect to those aspects. 1

show abstract

Pro-MA: Progressively Margin-based Attribution in Pre-trained Vision-Language Models

You,

2024

Preprint

View full text Add to dashboard Cite

Knowledge attribution is focused on analyzing the internal knowledge architecture of neural networks, aiming to accurately identify the factual knowledge stored within a model. Existing knowledge attribution methods concentrate on Pre-trained Language Models (PLMs), with limited exploration in Pre-trained Vision-Language Models (VLMs). However, the knowledge attribution methods of PLMs cannot be directly utilized on VLMs due to different decision-making processes and unstable integral gradients. Specifically, VLMs involve extra knowledge-irrelevant steps like localization and show high variance in gradient distributions along the attribution paths. To address the above challenges, we propose Progressive Margin-based Attribution (Pro-MA) to achieve more accurate attribute knowledge attribution in VLMs. Firstly, we employ cross-attribute comparison to identify and eliminate irrelevant knowledge by constructing high-similarity negative samples. Secondly, we utilize progressive attribution scoring that quantifies the contribution of neurons to the output by the margin of the attribute-value loss function along the path of active neurons. Evaluation on the OVAD and VAW datasets shows that Pro-MA outperforms the state-of-the-art knowledge attribution methods.

show abstract

BERTnesia: Investigating the capture and forgetting of knowledge in BERT

Cited by 3 publications

References 35 publications

Understanding Finetuning for Factual Knowledge Extraction from Language Models

Understanding Finetuning for Factual Knowledge Extraction from Language Models

A Review on Language Models as Knowledge Bases

Pro-MA: Progressively Margin-based Attribution in Pre-trained Vision-Language Models

Contact Info

Product

Resources

About