Large language models are able to perform a task by conditioning on a few inputoutput demonstrations -a paradigm known as in-context learning. We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples. To explore this ability, we introduce the instruction induction challenge, compile a dataset consisting of 24 tasks, and define a novel evaluation metric based on executing the generated instruction. We discover that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; InstructGPT achieves 65.7% of human performance in our execution-based metric, while the original GPT-3 model reaches only 9.8% of human performance. This surprising result suggests that instruction induction might be a viable learning paradigm in and of itself, where instead of fitting a set of latent continuous parameters to the data, one searches for the best description in the natural language hypothesis space. 1 We examine instruction induction on 24 tasks, ranging from morphosyntactic tasks (e.g., pluralization) to style transfer (e.g., formality) and sentiment analysis. As a basic evaluation protocol, we collect human annotations and use them as gold-standard references; the generated instructions are then compared to these references using BERTScore [Zhang et al., 2020]. Moreover, we suggest a novel evaluation metric for instruction induction: execution accuracy. The execution accuracy of a generated instruction is measured by testing whether LMs can correctly perform the task in a zero-shot manner by using the generated instruction alone, without any demonstrations.1 Our code and data are publicly available at https://github.com/orhonovich/instruction-induction Preprint. Under review.