Denny Zhou scite author profile

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM).We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-ofthe-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned stateof-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. * Equal Contribution. Author contributions and ordering details are listed in Appendix A.

show abstract

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Sun

Xue

et al. 2020

408

349

View full text Add to dashboard Cite

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resourcelimited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT LARGE , while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an invertedbottleneck incorporated BERT LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3× smaller and 5.5× faster than BERT BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUE score of 77.7 (0.6 lower than BERT BASE ), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT BASE ). * This work was done when the first author was an intern at Google Brain.

show abstract

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Lee¹,

Wang²,

Schuurmans³

et al. 2022

Preprint

336

324

View full text Add to dashboard Cite

Scaling Instruction-Finetuned Language Models

Chung¹,

Hou²,

Longpre³

et al. 2022

Preprint

152

124

View full text Add to dashboard Cite

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang¹,

Lee²,

Schuurmans³

et al. 2022

Preprint

125

113

View full text Add to dashboard Cite

We explore a simple ensemble strategy, self-consistency, that significantly improves the reasoning accuracy of large language models. The idea is to sample a diverse set of outputs from a language model and return the most consistent answer in the set. Such ensembling method improves reasoning accuracy when combined with chain of thought prompting. For arithmetic and commonsense reasoning benchmarks we find that self-consistency yields significant accuracy improvements in a variety of datasets, such as GSM8K (+10%), SVAMP (+14%), MultiArith (+24%), CommonsenseQA (+5%) and ARC (easy +4%, challenge +5%).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Denny Zhou

PaLM: Scaling Language Modeling with Pathways

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Scaling Instruction-Finetuned Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Contact Info

Product

Resources

About