Are Pretrained Language Models Symbolic Reasoners Over Knowledge?

Kassner, Nora; Krojer, Benno; Schütze, Hinrich

doi:10.48550/arxiv.2006.10413

Cited by 4 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of works have studied the impact of word frequency on different aspects of LLM and, in particular, on the quality of the delivered representations. Kassner et al (2020) have studied BERT models and possible memorization based on token frequency, demonstrating that if a token appears fewer than 15 times, the model will disregard it, while a token that appears 100 times or more will be predicted more accurately. Zhou et al (2022) demonstrated that high frequency words and low frequency words are represented differently by transformer LLM, in particular by BERT.…”

Section: Related Workmentioning

confidence: 99%

Frequency Balanced Datasets Lead to Better Language Models

Zevallos,

Farrús,

Bel

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

This paper reports on the experiments aimed to improve our understanding of the role of the amount of data required for training attentionbased transformer language models. Specifically, we investigate the impact of reducing the immense amounts of required pre-training data through sampling strategies that identify and reduce high-frequency tokens as different studies have indicated that the existence of very high-frequency tokens in pre-training data might bias learning, causing undesired effects. In this light, we describe our sampling algorithm that iteratively assesses token frequencies and removes sentences that contain still high-frequency tokens, eventually delivering a balanced, linguistically correct dataset. We evaluate the results in terms of model perplexity and fine-tuning linguistic probing tasks, NLP downstream tasks as well as more semantic SuperGlue tasks. The results show that pretraining with the resulting balanced dataset allows reducing up to three times the pre-training data.

show abstract

Section: Related Workmentioning

confidence: 99%

Frequency Balanced Datasets Lead to Better Language Models

Zevallos,

Farrús,

Bel

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Language Understanding Benchmarks. Previous NLP benchmarks are usually for evaluate general language understanding, such as slot filling (Elsahar et al, 2019;Levy et al, 2017), QA Rajpurkar et al, 2016;Joshi et al, 2017;Fan et al, 2019;Ding et al, 2019;Clark et al, 2019;Kassner et al, 2020), dialogue (Dinan et al, 2018), entailment (Williams et al, 2018;Rocktäschel et al, 2015;Dagan et al, 2005;Morgenstern and Ortiz, 2015). For example, some question answering tasks aim to evaluate machine reading comprehension or reason over a knowledge source, such as Wikipedia.…”

Section: Related Workmentioning

confidence: 99%

KMIR: A Benchmark for Evaluating Knowledge Memorization, Identification and Reasoning Abilities of Language Models

Gao¹,

Jia²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Previous works show the great potential of pretrained language models (PLMs) for storing a large amount of factual knowledge. However, to figure out whether PLMs can be reliable knowledge sources and used as alternative knowledge bases (KBs), we need to further explore some critical features of PLMs. Firstly, knowledge memorization and identification abilities: traditional KBs can store various types of entities and relationships; do PLMs have a high knowledge capacity to store different types of knowledge? Secondly, reasoning ability: a qualified knowledge source should not only provide a collection of facts, but support a symbolic reasoner. Can PLMs derive new knowledge based on the correlations between facts? To evaluate these features of PLMs, we propose a benchmark, named Knowledge Memorization, Identification, and Reasoning test (KMIR). KMIR covers 3 types of knowledge, including general knowledge, domain-specific knowledge, and commonsense, and provides 184,348 well-designed questions. Preliminary experiments with various representative pretraining language models on KMIR reveal many interesting phenomenons: 1) The memorization ability of PLMs depends more on the number of parameters than training schemes. 2) Current PLMs are struggling to robustly remember the facts. 3) Model compression technology retains the amount of knowledge well, but hurts the identification and reasoning abilities. We hope KMIR can facilitate the design of PLMs as better knowledge sources.

show abstract

“…Embedding-based methods first convert symbolic facts and rules to embeddings and then apply neural network layers on top to softly predict answers. Recent work in deductive reasoning focused on tasks where rules and facts are expressed in natural language (Talmor et al, 2020;Saeed et al, 2021;Clark et al, 2020b;Kassner et al, 2020). Such tasks are more challenging because the model has to first understand the logic described in the natural language sentences before performing logical reasoning.…”

Section: Related Workmentioning

confidence: 99%

Reasoning over Logically Interacted Conditions for Question Answering

Sun¹,

Cohen²,

Salakhutdinov³

2022

Preprint

View full text Add to dashboard Cite

Some questions have multiple answers that are not equally correct, i.e. answers are different under different conditions. Conditions are used to distinguish answers as well as to provide additional information to support them. In this paper, we study a more challenging task where answers are constrained by a list of conditions that logically interact, which requires performing logical reasoning over the conditions to determine the correctness of the answers. Even more challenging, we only provide evidences for a subset of the conditions, so some questions may not have deterministic answers. In such cases, models are asked to find probable answers and identify conditions that need to be satisfied to make the answers correct. We propose a new model, TReasoner, for this challenging reasoning task. TReasoner consists of an entailment module, a reasoning module, and a generation module (if the answers are free-form text spans). TReasoner achieves state-of-the-art performance on two benchmark conditional QA datasets, outperforming the previous state-of-the-art by 3-10 points. 1

show abstract

Are Pretrained Language Models Symbolic Reasoners Over Knowledge?

Cited by 4 publications

References 11 publications

Frequency Balanced Datasets Lead to Better Language Models

Frequency Balanced Datasets Lead to Better Language Models

KMIR: A Benchmark for Evaluating Knowledge Memorization, Identification and Reasoning Abilities of Language Models

Reasoning over Logically Interacted Conditions for Question Answering

Contact Info

Product

Resources

About