Neel Guha scite author profile

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles (e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities, and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

show abstract

Holistic Evaluation of Language Models

Liang¹,

Bommasani²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios to the extent possible (87.5% of the time), ensuring that metrics beyond accuracy don't fall to the wayside, and that trade-offs across models and metrics are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to more deeply analyze specific aspects (e.g. knowledge, reasoning, memorization/copyright, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on a set of core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings concerning the interplay between different scenarios, metrics, and models. For full transparency, we release all raw model prompts and completions publicly 3 for further analysis, as well as a general modular toolkit for easily adding new scenarios, models, metrics, and prompting strategies. 4 We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

show abstract

ChatGPT and Physicians’ Malpractice Risk

Guha

2023

JAMA Health Forum

View full text Add to dashboard Cite

show abstract

Leveraging Administrative Data for Bias Audits

Coston

Guha

Ouyang

et al. 2021

View full text Add to dashboard Cite

When does pretraining help?

Zheng

Guha

Anderson

et al. 2021

View full text Add to dashboard Cite

Ask Me Anything: A simple strategy for prompting language models

Arora¹,

Narayan²,

F.³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly perfect prompt for a task. To mitigate the high degree of effort involved in prompting, we instead ask whether collecting multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING PROMPTING (AMA). We first develop an understanding of the effective prompt formats, finding question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. Output True or False"). Our approach recursively uses the LLM to transform task inputs to the effective QA format. We apply these prompts to collect several noisy votes for the input's true label. We find that these prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions. We evaluate AMA across open-source model families (EleutherAI, BLOOM, OPT, and T0) and sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code for reproducing the results here: https://github.com/HazyResearch/ama_prompting.Recent work has evaluated LLM prompting performance on a broad set of tasks and finds the process to be brittlesmall changes to the prompt result in large performance variations [Zhao et al., 2021, Holtzman et al., 2021. The performance further varies depending on the chosen LLM family [Ouyang et al., 2022 and model size [Wei et al., 2022a, Lampinen et al., 2022. To improve reliability, significant effort is dedicated towards designing a painstakingly perfect prompt. For instance, Mishra et al. [2021] and recommend that users manually explore large search-spaces of strategies to tune their prompts on a task-by-task basis.

show abstract

Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation

Orr

Leszczynski

Arora

et al. 2020

Preprint

View full text Add to dashboard Cite

A challenge for named entity disambiguation (NED), the task of mapping textual mentions to entities in a knowledge base, is how to disambiguate entities that appear rarely in the training data, termed tail entities. Humans use subtle reasoning patterns based on knowledge of entity facts, relations, and types to disambiguate unfamiliar entities. Inspired by these patterns, we introduce Bootleg, a self-supervised NED system that is explicitly grounded in reasoning patterns for disambiguation. We define core reasoning patterns for disambiguation, create a learning procedure to encourage the self-supervised model to learn the patterns, and show how to use weak supervision to enhance the signals in the training data. Encoding the reasoning patterns in a simple Transformer architecture, Bootleg meets or exceeds state-of-the-art on three NED benchmarks. We further show that the learned representations from Bootleg successfully transfer to other non-disambiguation tasks that require entity-based knowledge: we set a new state-ofthe-art in the popular TACRED relation extraction task by 1.0 F1 points and demonstrate up to 8% performance lift in highly optimized production search and assistant tasks at a major technology company.

show abstract

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

Zheng¹,

Guha²,

Anderson³

et al. 2021

Preprint

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Neel Guha

On the Opportunities and Risks of Foundation Models

Holistic Evaluation of Language Models

ChatGPT and Physicians’ Malpractice Risk

Leveraging Administrative Data for Bias Audits

When does pretraining help?

Ask Me Anything: A simple strategy for prompting language models

Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

Contact Info

Product

Resources

About