2021
DOI: 10.48550/arxiv.2106.00969
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 39 publications
0
2
0
Order By: Relevance
“…As emphasized in a recent study (Davis 2023), the concept of commonsense reasoning implies that its involved commonsense knowledge is common. Thus, commonsense AI should be expected to generalize, that is, at least in aggregate, should not exhibit excessive performance loss across independent commonsense benchmarks, such as (Bhagavatula et al 2020;Singh et al 2021;Santos et al 2022;Kejriwal et al 2023), regardless of the specific benchmark on (the training set of) which it has been fine-tuned. In my first work (Shen and Kejriwal 2021), we evaluated this expectation by proposing a methodology and experimental study to measure the generalization ability of transformer-based language models using statistical significance analysis and a rigorous and intuitive metric (i.e., performance loss metric).…”
Section: Generalizability Evaluationmentioning
confidence: 99%
“…As emphasized in a recent study (Davis 2023), the concept of commonsense reasoning implies that its involved commonsense knowledge is common. Thus, commonsense AI should be expected to generalize, that is, at least in aggregate, should not exhibit excessive performance loss across independent commonsense benchmarks, such as (Bhagavatula et al 2020;Singh et al 2021;Santos et al 2022;Kejriwal et al 2023), regardless of the specific benchmark on (the training set of) which it has been fine-tuned. In my first work (Shen and Kejriwal 2021), we evaluated this expectation by proposing a methodology and experimental study to measure the generalization ability of transformer-based language models using statistical significance analysis and a rigorous and intuitive metric (i.e., performance loss metric).…”
Section: Generalizability Evaluationmentioning
confidence: 99%
“…Avicenna [1] Complete a 6000 records Crowd sourcing valid syllogism CA-EHN [84] Chinese word analogy 90,505 analogies Adapted Labelled by experts CIDER [49] Causal explanation 807 dialogues, Adapted datasets. 4539 causal triplets with expert annotations CODAH [23] Sentence completion 28,000 questions Crowd sourcing Chinese WSC [147] Pronoun resolution 1838 questions Expert construction CommonGen [86] Make a sentence 35141 concept sets Crowd sourcing from given words 77449 sentences Com2Sense [124] Is a sentence plausible? 4000 sentence pairs Crowd sourcing CosmosQA [63] Question answering 35,600 problems Crowd sourcing CommonsenseQA [130] Question answering 12,247 questions Crowd sourcing CommonsenseQA 2.0 [131] Yes/no questions 14,343 questions Gamification COPA [111] Select a conclusion causally 1000 questions Expert authors connected to a premise CREAK [104] True/False questions 13,000 questions Crowd sourcing CycIC (No paper) Question answering 10,700 questions Synthesized DefeasibleNLI [114] Does new information 250,000 examples Crowd sourcing strenghthen an inference?…”
Section: Collections Of Elementary Mathematical Word Problemsmentioning
confidence: 99%