Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge

Pedinotti, Paolo; Rambelli, Giulia; Chersoni, Emmanuele; Santus, Enrico; Lenci, Alessandro; Blache, Philippe

doi:10.18653/v1/2021.starsem-1.1

Cited by 10 publications

(11 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is a list of seven structures -cleft, left dislocated, right dislocated, presentative "ci", inverted subject, pseudo-clefts, hanging topic -with a majority of Cleft sentences and Left dislocated sentences. As said above, similar results are obtained by the experiment presented in the paper by Pedinotti et al [11] where in Section IV they test the ability of Transformers -they use RoBERTa -on a small dataset with surface syntactic structures different from the recurrent word order. They modify the sentences to produce cleft and interrogative versions of the same sentences.…”

Section: The Dataset and The State-of-the-artsupporting

confidence: 77%

“…A partly similar approach has been attempted by Pedinotti et al [11], in a paper where they explore the ability of Transformer Models to predict transitive verb complements in typical predicate-argument contexts. Their results show clearly the inability to predict low frequency near synonyms, thus confirming the sensitivity of BERT-like models to frequency values.…”

Section: Word Predictability In Cognitive and Psycholinguistic Researchmentioning

confidence: 99%

See 1 more Smart Citation

Stress Test for Bert and Deep Models: Predicting Words from Italian Poetry

Delmonte¹,

Busetto²

2022

IJNLC

View full text Add to dashboard Cite

In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of 900 tokens – mostly extracted from Italian poetry of the first half of last century. Then we alternated canonical and non-canonical versions of the same sentence before processing them with the same DL model. We used then sentences from the newswire domain containing similar syntactic structures. The results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases. In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary words.

show abstract

Section: The Dataset and The State-of-the-artsupporting

confidence: 77%

Section: Word Predictability In Cognitive and Psycholinguistic Researchmentioning

confidence: 99%

Stress Test for Bert and Deep Models: Predicting Words from Italian Poetry

Delmonte¹,

Busetto²

2022

IJNLC

View full text Add to dashboard Cite

show abstract

“…Does this representation perpetuate the negative stereotype that men are bad at cooking? To investigate this, we should dive deeper into the semantic plausibility learned in language models (Porada et al, 2021;Pedinotti et al, 2021). Unless the focus is on the domain of natural science, there is less agreement on what would lean in spreading desirable and undesirable content, and the borderline can change across time and place.…”

Section: Content Validation For Fair Representationmentioning

confidence: 99%

Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

Ashida¹,

Sugawara²

2022

Preprint

View full text Add to dashboard Cite

The possible consequences for the same context may vary depending on the situation we refer to. However, current studies in natural language processing do not focus on situated commonsense reasoning under multiple possible scenarios. This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text. Our resulting dataset, Possible Stories, consists of more than 4.5K questions over 1.3K story texts in English. We discover that even current strong pretrained language models struggle to answer the questions consistently, highlighting that the highest accuracy in an unsupervised setting (60.2%) is far behind human accuracy (92.5%). Through a comparison with existing datasets, we observe that the questions in our dataset contain minimal annotation artifacts in the answer options. In addition, our dataset includes examples that require counterfactual reasoning, as well as those requiring readers' reactions and fictional information, suggesting that our dataset can serve as a challenging testbed for future studies on situated commonsense reasoning.

show abstract

“…On the one hand, even non‐fine‐tuned LLMs perform well on multiple tasks designed to probe world knowledge, such as the Winograd Schema Challenge (WSC; Levesque, Davis, & Morgenstern, 2012), the Story Cloze Test (SWAG; Zellers et al., 2018), and the Choice of Plausible Alternatives Test (COPA; Roemmele, Bejan, & Gordon, 2011), so much so that some authors have proposed and evaluated their use as off‐the‐shelf knowledge base models (Kassner, Dufter, & Schütze, 2021; Petroni et al., 2019; Roberts et al., 2020; Tamborrino, Pellicanò, Pannier, Voitot, & Naudin, 2020). On the other hand, studies using more fine‐grained tests have shown that world knowledge in contemporary LLMs is often brittle and depends strongly on the specific way the problem is stated (Elazar et al., 2021a; 2021b; Ettinger, 2020; Kassner & Schütze, 2020; McCoy, Pavlick, & Linzen, 2019; Niven & Kao, 2019; Pedinotti et al., 2021; Ravichander, Hovy, Suleman, Trischler, & Cheung, 2020; Ribeiro, Wu, Guestrin, & Singh, 2020). For example, some authors have noted that, when low‐level co‐occurrence statistics are properly controlled for, LLMs that were considered to have high accuracy on world knowledge tasks start to perform randomly (Elazar, Zhang, Goldberg, & Roth, 2021b; Sakaguchi, Bras, Bhagavatula, & Choi, 2021), highlighting the potential discrepancy between the word‐in‐context prediction objective (which benefits from tracking surface‐level statistics) and world knowledge acquisition (which should be invariant to surface‐level statistics).…”

Section: Introductionmentioning

confidence: 99%

“…To assess the plausibility of an arbitrary event, a successful model of GEK must, therefore, acquire robust, generalizable representations of a vast number of actions and their associated restrictions on event participants. Many traditional and current distributional models have been argued to lack the representations of these building blocks for more complex semantic structures (Lenci, 2023; Lenci & Sahlgren, 2023; Pedinotti et al., 2021; Zhu, Li, & De Melo, 2018). The acquisition of GEK is complicated even more because the frequency with which events are reported in the pragmatically influenced texts available in the world is not a robust indicator of the frequency with which they occur in the real world (Gordon & Van Durme, 2013; see also Section 4.3).…”

Section: Introductionmentioning

confidence: 99%

Event Knowledge in Large Language Models: The Gap Between the Impossible and the Unlikely

Kauf,

Ivanova,

Rambelli

et al. 2023

Cognitive Science

Self Cite

View full text Add to dashboard Cite

Word co‐occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs’ semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pretrained LLMs (from 2018's BERT to 2023's MPT) assign a higher likelihood to plausible descriptions of agent−patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n = 1215), we found that pretrained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign a higher likelihood to possible versus impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely versus unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow‐up analyses, we show that (i) LLM scores are driven by both plausibility and surface‐level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.

show abstract

Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge

Cited by 10 publications

References 30 publications

Stress Test for Bert and Deep Models: Predicting Words from Italian Poetry

Stress Test for Bert and Deep Models: Predicting Words from Italian Poetry

Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

Event Knowledge in Large Language Models: The Gap Between the Impossible and the Unlikely

Contact Info

Product

Resources

About