2022
DOI: 10.48550/arxiv.2204.07705
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks

Abstract: How can we measure the generalization of models to a variety of unseen tasks when provided with their language instructions? To facilitate progress in this goal, we introduce NATURAL-INSTRUCTIONS v2 , a benchmark of 1,600+ diverse language tasks and their expertwritten instructions. It covers 70+ distinct task types, such as tagging, in-filling, and rewriting. These tasks are collected with contributions of NLP practitioners in the community and through an iterative peer review process to ensure their quality.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(17 citation statements)
references
References 12 publications
(12 reference statements)
1
16
0
Order By: Relevance
“…In preliminary experiments, we found that T0 was not able to perform few-shot in-context learning -performance actually decreased as we increased the number of in-context examples. This is likely because of the zero-shot format used during multitask prompted fine-tuning and corroborates a recent finding by [10].…”
Section: Performance On T0 Askssupporting
confidence: 88%
See 1 more Smart Citation
“…In preliminary experiments, we found that T0 was not able to perform few-shot in-context learning -performance actually decreased as we increased the number of in-context examples. This is likely because of the zero-shot format used during multitask prompted fine-tuning and corroborates a recent finding by [10].…”
Section: Performance On T0 Askssupporting
confidence: 88%
“…Performing ICL therefore solely relies on the capabilities that a model learned during pre-training. These characteristics have led to a great deal of recent interest in ICL methods [5][6][7][8][9][10].…”
Section: Introductionmentioning
confidence: 99%
“…The Flan 2022 Collection offers the most extensive publicly available set of tasks and methods for instruction tuning, which we have compiled in one place, and supplemented with hundreds more high-quality templates and richer formatting patterns. We show that a model trained on this collection outperforms other public collections on all tested evaluation benchmarks, including the original Flan 2021 (Wei et al, 2021), T0++ (Sanh et al, 2021), Super-Natural Instructions (Wang et al, 2022c), and the concurrent work on OPT-IML (Iyer et al, 2022). As shown in Figure 1, this includes a 4.2%+ and 8.5% improvements on the MMLU (Hendrycks et al, 2020) and BIG-Bench Hard (Suzgun et al, 2022) evaluation benchmarks, for equally sized models.…”
Section: Introductionmentioning
confidence: 82%
“…To facilitate the same interface for various customized visual tasks in the wild, it is desirable to have the same uniform task instruction schema. In NLP, all task instructions can follow the same uniform schema, composed of task definition and positive/negative examples [50,70]. Here, the task definition defines a given task in natural language, completely specifying how an input is expected to be mapped to an output text.…”
Section: Retrieval-augmented Task Instructionmentioning
confidence: 99%