2022
DOI: 10.48550/arxiv.2210.09261
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Abstract: BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
50
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 29 publications
(51 citation statements)
references
References 13 publications
(26 reference statements)
1
50
0
Order By: Relevance
“…We show that a model trained on this collection outperforms other public collections on all tested evaluation benchmarks, including the original Flan 2021 (Wei et al, 2021), T0++ (Sanh et al, 2021), Super-Natural Instructions (Wang et al, 2022c), and the concurrent work on OPT-IML (Iyer et al, 2022). As shown in Figure 1, this includes a 4.2%+ and 8.5% improvements on the MMLU (Hendrycks et al, 2020) and BIG-Bench Hard (Suzgun et al, 2022) evaluation benchmarks, for equally sized models.…”
Section: Introductionmentioning
confidence: 81%
See 2 more Smart Citations
“…We show that a model trained on this collection outperforms other public collections on all tested evaluation benchmarks, including the original Flan 2021 (Wei et al, 2021), T0++ (Sanh et al, 2021), Super-Natural Instructions (Wang et al, 2022c), and the concurrent work on OPT-IML (Iyer et al, 2022). As shown in Figure 1, this includes a 4.2%+ and 8.5% improvements on the MMLU (Hendrycks et al, 2020) and BIG-Bench Hard (Suzgun et al, 2022) evaluation benchmarks, for equally sized models.…”
Section: Introductionmentioning
confidence: 81%
“…We finetune on the prefix language model adapted T5-LM (Lester et al, 2021), using the XL (3B) size for all models, unless otherwise stated. We evaluate on (a) a suite of 8 "Held-In" tasks represented within the 1800+ training task collection (4 question answering and 4 natural language inference validation sets), (b) Chain-of-Thought (CoT) tasks (5 validation sets), and (c) the MMLU (Hendrycks et al, 2020) and BBH (Suzgun et al, 2022) benchmarks as our set of "Held-Out" tasks, as they are not included as part of Flan 2022 finetuning. The Massivley Multitask Language Understanding benchmark (MMLU) broadly tests reasoning and knowledge capacity across 57 tasks in the sciences, social sciences, humanities, business, health, among other subjects.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…When it was introduced, CoT prompting demonstrated benefits for math and commonsense reasoning. Since then, Suzgun et al (2022) report that CoT prompting gives substantial improvements for a hard subset of the BIG-Bench tasks (Srivastava et al, 2022). 6 This makes it a promising prompt for our proposed task of reasoning about implications of negation.…”
Section: C1 Fully Finetunedmentioning
confidence: 99%
“…To assess SELF-ICL's effectiveness on challenging, unexpected tasks for which existing demonstrations are hard to come by, we perform evaluation on a suite of 23 tasks from BIG-Bench Hard (BBH) (Suzgun et al, 2022). In a head-tohead comparison, experimental results show 16-1-6 (win-tie-lose) for SELF-ICL versus standard zeroshot on the 23 tasks.…”
Section: Introductionmentioning
confidence: 99%