Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language model training . Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping general natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts using varying natural language. These prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in natural language. We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16× its size. Further, our approach attains strong performance on a subset of tasks from the BIG-Bench benchmark, outperforming models up to 6× its size. All prompts and trained models are available at github.com/bigscience-workshop/promptsource/ and huggingface.co/bigscience/T0pp.
Recently, a boom of papers have shown extraordinary progress in few-shot learning with various prompt-based models. Such success can give the impression that prompts help models to learn faster in the same way that humans learn faster when provided with task instructions expressed in natural language. In this study, we experiment with over 30 prompts manually written for natural language inference (NLI). We find that models learn just as fast with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively "good" prompts. Additionally, we find that model performance is more dependent on the choice of the LM target words (a.k.a. the "verbalizer" that converts LM vocabulary predictions to class labels) than on the text of the prompt itself. In sum, we find little evidence that suggests existing promptbased models truly understand the meaning of their given prompts. IntroductionSuppose a human is given two sentences: "No weapons of mass destruction found in Iraq yet." and "Weapons of mass destruction found in Iraq." They are then asked to respond 0 or 1 and receive a reward if they are correct. In this setup, they would likely need a large number of trials and errors before figuring out what they are really being rewarded to do. This setup is akin to the pretrainand-fine-tune setup which has dominated NLP in recent years, in which models are asked to classify a sentence representation (e.g., a CLS token) into some arbitrary dimensions of a one-hot vector. In contrast, suppose a human is given a prompt such as Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that "no weapons of mass destruction found in Iraq yet.", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that "weapons of mass destruction found in Iraq."? "? "? "? "? "? "? "? "? "? "? "? "? "? "? "? "? 1 Then it would be no surprise that they are able to perform the task more accurately and without needing many examples to figure out what the task is.
We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 . Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks-motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available. 1
Recently, a boom of papers has shown extraordinary progress in zero-shot and few-shot learning with various prompt-based models. It is commonly argued that prompts help models to learn faster in the same way that humans learn faster when provided with task instructions expressed in natural language. In this study, we experiment with over 30 prompt templates manually written for natural language inference (NLI). We find that models can learn just as fast with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively "good" prompts. Further, such patterns hold even for models as large as 175 billion parameters (Brown et al., 2020) as well as the recently proposed instruction-tuned models which are trained on hundreds of prompts (Sanh et al., 2021). That is, instruction-tuned models often produce good predictions with irrelevant and misleading prompts even at zero shots. In sum, notwithstanding prompt-based models' impressive improvement, we find evidence of serious limitations that question the degree to which such improvement is derived from models understanding task instructions in ways analogous to humans' use of task instructions. * Unabridged version available on arXiv. Code, interactive figures, and statistical test results available at https://github. com/awebson/prompt_semantics arbitrary dimensions of a one-hot vector. In contrast, suppose a human is given a prompt such as: Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that " Given that "no weapons of mass destruction found in Iraq yet.", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that " ", is it definitely correct that "weapons of mass destruction found in Iraq."? "? "? "? "? "? "? "? "? "? "? "? "? "? "? "? "? 1 Then it would be no surprise that they are able to perform the task more accurately and without needing many examples to figure out what the task is.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.