Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Wang, Yizhong; Mishra, Swaroop; Alipoormolabashi, Pegah; Kordi, Yeganeh; Amirreza, Mirzaei,; Naik, Ajit Kumar; Ashok, Arjun; Dhanasekaran, Arut Selvan; Arunkumar, Anjana; David, Stap,; Pathak, Eshaan; Karamanolakis, Giannis; Lai, H. Y.; Purohit, Ishan; Mondal, Ishani; Anderson, J.; Kirby, Kuznia,; Doshi, Krima; Pal, Kuntal Kumar; Patel, Maitreya; Moradshahi, Mehrad; Parmar, Mihir; Purohit, Mirali; Varshney, Neeraj; Kaza, Phani Rohitha; Verma, Pulkit; Singh, Puri, Ravsehaj; Karia, Rushang; Doshi, Savan; Sampat, Shailaja Keyur; Mishra, Siddhartha; A, Sujan Reddy; Patro, Sumanta; Tanay, Dixit,; Shen, Xudong

doi:10.18653/v1/2022.emnlp-main.340

Cited by 47 publications

(13 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CrossFit (Ye et al, 2021). To investigate models' few-shot learning capabilities across tasks, a collection of 269 NLP task datasets, known as CrossFit, has been assembled, covering 13 task types (Wang et al, 2022). In addition to being used for instruction fine-tuning, this dataset is employed for studying models' cross-task generalization and transfer learning abilities.…”

Section: Collection and Improvement Of Existing Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Datasets for Large Language Models: A Comprehensive Survey

Liu,

Cao,

Liu

et al. 2024

Preprint

View full text Add to dashboard Cite

This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: \href{https://github.com/lmmlzn/Awesome-LLMs-Datasets}{https://github.com/lmmlzn/Awesome-LLMs-Datasets}.

show abstract

Section: Collection and Improvement Of Existing Datasetsmentioning

confidence: 99%

“…Flan 2022 (Longpre et al, 2023a). The Flan 2022 dataset consists of five parts, namely Flan 2021, T0 (Victor et al, 2022), SUPER-NATURAL INSTRUCTIONS (Wang et al, 2022), CoT datasets, and Dialog datasets. It encompasses as many as 1836 tasks.…”

Section: Collection and Improvement Of Existing Datasetsmentioning

confidence: 99%

Datasets for Large Language Models: A Comprehensive Survey

Liu,

Cao,

Liu

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…These efforts include benchmarks like BLUE [63], HunFlair [90], BLURB [20], and BigBio [16], which provide datasets and tasks for evaluating biomedical language understanding and reasoning. Additionally, there are biomedical datasets geared towards prompt-based learning and evaluation of few and zero-shot classification, such as Super-NaturalInstructions [89] and BoX [61]. Out of all benchmarks mentioned above, only BoX contains one CS dataset covering five SLRs, however, this dataset is private.…”

Section: Dataset Overlapmentioning

confidence: 99%

An analysis of work saved over sampling in the evaluation of automated citation screening in systematic literature reviews

Kusa

Lipani

Knoth

et al. 2023

Intelligent Systems with Applications

View full text Add to dashboard Cite

“…As generalisation to new domains (with limited in-domain annotation effort) is one of the main desiderata of TOD, some recent work on dialog NLU (Fuisz et al, 2022; has recognised that ID and VE can be cast as question answering (QA) tasks: this facilitates transfer from models trained on large QA datasets (Rajpurkar et al, 2016a;, allowing also to capitalise on other large datasets previously recast as QA (McCann et al, 2018;Wang et al, 2022b). These efforts, however, amount to sequential transfer with standard fine-tuning for QA and thus (i) do not align their fine-tuning with the models' pretraining objective; and without an LM-based objective they (ii) cannot benefit from cross-task transfer via natural language task formulations.…”

Section: Introductionmentioning

confidence: 99%

Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems

Razumovskaia

Glavaš

Majewska

et al. 2022

jair

View full text Add to dashboard Cite

In task-oriented dialogue (ToD), a user holds a conversation with an artificial agent with the aim of completing a concrete task. Although this technology represents one of the central objectives of AI and has been the focus of ever more intense research and development efforts, it is currently limited to a few narrow domains (e.g., food ordering, ticket booking) and a handful of languages (e.g., English, Chinese). This work provides an extensive overview of existing methods and resources in multilingual ToD as an entry point to this exciting and emerging field. We find that the most critical factor preventing the creation of truly multilingual ToD systems is the lack of datasets in most languages for both training and evaluation. In fact, acquiring annotations or human feedback for each component of modular systems or for data-hungry end-to-end systems is expensive and tedious. Hence, state-of-the-art approaches to multilingual ToD mostly rely on (zero- or few-shot) cross-lingual transfer from resource-rich languages (almost exclusively English), either by means of (i) machine translation or (ii) multilingual representations. These approaches are currently viable only for typologically similar languages and languages with parallel / monolingual corpora available. On the other hand, their effectiveness beyond these boundaries is doubtful or hard to assess due to the lack of linguistically diverse benchmarks (especially for natural language generation and end-to-end evaluation). To overcome this limitation, we draw parallels between components of the ToD pipeline and other NLP tasks, which can inspire solutions for learning in low-resource scenarios. Finally, we list additional challenges that multilinguality poses for related areas (such as speech, fluency in generated text, and human-centred evaluation), and indicate future directions that hold promise to further expand language coverage and dialogue capabilities of current ToD systems.

show abstract

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Cited by 47 publications

References 0 publications

Datasets for Large Language Models: A Comprehensive Survey

Datasets for Large Language Models: A Comprehensive Survey

An analysis of work saved over sampling in the evaluation of automated citation screening in systematic literature reviews

Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems

Contact Info

Product

Resources

About