Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs

Warstadt, Alex; Cao, Yu; Grosu, Ioana Georgeta; Wei, Ping; Blix, Hagen; Nie, Yining; Alsop, Anna; Bordia, Shikha; Liu, Haokun; Parrish, Alicia; Wang, Shengfu; Phang, Jason; Mohananey, Anhad; Htut, Phu Mon; Jeretič, Paloma; Bowman, Samuel R.

doi:10.48550/arxiv.1909.02597

Cited by 3 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generating data lets us control the lexical and syntactic content so that we can guarantee that the sentence pairs in IMPPRES evaluate the desired phenomenon (see Ettinger et al, 2016, for related discussion). We generate IMPPRES according to expert-crafted grammars using a codebase developed by Warstadt et al (2019). The codebase includes a vocabulary of over 3000 lexical items annotated with grammatical features needed to ensure morphological, syntactic, and semantic well-formedness.…”

Section: Methodsmentioning

confidence: 99%

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

Jeretič¹,

Warstadt²,

Bhooshan³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Natural language inference (NLI) is an increasingly important task for natural language understanding, which requires one to infer whether a sentence entails another. However, the ability of NLI models to make pragmatic inferences remains understudied. We create an IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of >25k semiautomatically generated sentence pairs illustrating well-studied pragmatic inference types. We use IMPPRES to evaluate whether BERT, InferSent, and BOW NLI models trained on MultiNLI (Williams et al., 2018) learn to make pragmatic inferences. Although MultiNLI appears to contain very few pairs illustrating these inference types, we find that BERT learns to draw pragmatic inferences. It reliably treats scalar implicatures triggered by "some" as entailments. For some presupposition triggers like only, BERT reliably recognizes the presupposition as an entailment, even when the trigger is embedded under an entailment canceling operator like negation. BOW and InferSent show weaker evidence of pragmatic reasoning. We conclude that NLI training encourages models to learn some, but not all, pragmatic inferences.

show abstract

Section: Methodsmentioning

confidence: 99%

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

Jeretič¹,

Warstadt²,

Bhooshan³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…This was the case even for sentences with distractor clauses between the subject and the verb, and meaningless sentences. A study of negative polarity items (NPIs) by Warstadt et al (2019) showed that BERT is better able to detect the presence of NPIs (e.g. "ever") and the words that allow their use (e.g.…”

Section: Linmentioning

confidence: 99%

“…Furthermore, different probing methods may reveal complementary or even contradictory information, in which case a single test (as done in most studies) would not be sufficient (Warstadt et al, 2019). Certain methods might also favor a certain model, e.g., RoBERTa is trailing BERT with one tree extraction method, but leading with another (Htut et al, 2019).…”

Section: Limitationsmentioning

confidence: 99%

A Primer in BERTology: What We Know About How BERT Works

Rogers

Kovaleva

Rumshisky

2020

Transactions of the Association for Computational Linguistics

828

581

View full text Add to dashboard Cite

Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.

show abstract

“…A complete description of the large body of work on probing is beyond the scope of this paper. Besides those discussed earlier, other aspects studied include filler-gap dependencies (Wilcox et al, 2018), function word comprehension (Kim et al, 2019), sentence-level properties (Adi et al, 2016) and negative polarity items (Warstadt et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

BERTering RAMS: What and How Much does BERT Already Know About Event Arguments? -- A Study on the RAMS Dataset

Gangal¹,

Hovy²

2020

Preprint

View full text Add to dashboard Cite

Using the attention map based probing framework from (Clark et al., 2019), we observe that, on the RAMS dataset (Ebner et al., 2020) 1 , BERT's attention heads 2 have modest but well above-chance ability to spot event arguments sans any training or domain finetuning, varying from a low of 17.77% for Place to a high of 51.61% for Artifact. Next, we find that linear combinations of these heads, estimated with ≈11% of available total event argument detection supervision, can push performance wellhigher for some roles -highest two being Victim (68.29% Accuracy) and Artifact (58.82% Accuracy). Furthermore, we investigate how well our methods do for cross-sentence event arguments. We propose a procedure to isolate "best heads" for cross-sentence argument detection separately of those for intra-sentence arguments. The heads thus estimated have superior cross-sentence performance compared to their jointly estimated equivalents, albeit only under the unrealistic assumption that we already know the argument is present in another sentence. Lastly, we seek to isolate to what extent our numbers stem from lexical frequency based associations between gold arguments and roles. We propose NONCE, a scheme to create adversarial test examples by replacing gold arguments with randomly generated "nonce" words. We find that learnt linear combinations are robust to NONCE, though individual best heads can be more sensitive.

show abstract

Investigating BERT's Knowledge of Language: Five Analysis Methods with NPIs

Cited by 3 publications

References 0 publications

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

A Primer in BERTology: What We Know About How BERT Works

BERTering RAMS: What and How Much does BERT Already Know About Event Arguments? -- A Study on the RAMS Dataset

Contact Info

Product

Resources

About