Synthetic Replacements for Human Survey Data? The Perils of Large Language Models

Bisbee, James; Clinton, Joshua D.; Dorff, Cassy; Kenkel, Brenton; Larson, Jennifer

doi:10.31235/osf.io/5ecfa

Cited by 8 publications

(18 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, standard limitations in the everyday use of LLMs also apply to their usage for classification tasks. Biases inherent in the training of these models (Bisbee et al, 2023;Motoki et al, 2024) may seep into text annotation, especially ones more specific or contentious than the classifications done here. Researchers should be mindful of these potential biases and carefully consider their impact on potential outcomes.…”

Section: Discussionmentioning

confidence: 95%

Large language models as a substitute for human experts in annotating political text

Heseltine,

Clemm von Hohenberg

2024

Research & Politics

View full text Add to dashboard Cite

Large-scale text analysis has grown rapidly as a method in political science and beyond. To date, text-as-data methods rely on large volumes of human-annotated training examples, which place a premium on researcher resources. However, advances in large language models (LLMs) may make automated annotation increasingly viable. This paper tests the performance of GPT-4 across a range of scenarios relevant for analysis of political text. We compare GPT-4 coding with human expert coding of tweets and news articles across four variables (whether text is political, its negativity, its sentiment, and its ideology) and across four countries (the United States, Chile, Germany, and Italy). GPT-4 coding is highly accurate, especially for shorter texts such as tweets, correctly classifying texts up to 95% of the time. Performance drops for longer news articles, and very slightly for non-English text. We introduce a ‘hybrid’ coding approach, in which disagreements of multiple GPT-4 runs are adjudicated by a human expert, which boosts accuracy. Finally, we explore downstream effects, finding that transformer models trained on hand-coded or GPT-4-coded data yield almost identical outcomes. Our results suggest that LLM-assisted coding is a viable and cost-efficient approach, although consideration should be given to task complexity.

show abstract

Section: Discussionmentioning

confidence: 95%

Large language models as a substitute for human experts in annotating political text

Heseltine,

Clemm von Hohenberg

2024

Research & Politics

View full text Add to dashboard Cite

show abstract

“…Some argue such “silicon samples” could be used to produce more diverse samples than the convenience samples utilized by so many university researchers—and may also allow researchers to administer lengthier survey instruments, since LLMs have potentially unlimited attention spans ( 12 ). At the same time, more recent research indicates GPT 3.5 turbo produces accurate mean estimates of attitudes within a population, but understates variances—exaggerating extreme attitudes ( 13 ). Another study indicates LLMs exhibit an affirmative bias in yes/no questions ( 14 ).…”

Section: Opportunities For Social Science With Generative Aimentioning

confidence: 99%

“…Another study indicates LLMs exhibit an affirmative bias in yes/no questions ( 14 ). Studies also indicate LLMs represent some demographic subgroups more accurately than others ( 13 , 15 ). Yet these studies do not employ the latest models, and only focus on one country: the United States.…”

Section: Opportunities For Social Science With Generative Aimentioning

confidence: 99%

“…Lower values might produce more dependable results, but a researcher who employs LLMs to interact with humans in a field experiment may not want them to become overly repetitive. Studies indicate subtle differences in the wording of prompts can also produce very different output within the same LLM ( 13 ). Fortunately, researchers are making progress identifying the sensitivity of prompt variations via automated processes that perturb text or vary the amount of context provided ( 73 ).…”

Section: Limitations and Possible Dangersmentioning

confidence: 99%

“…This may create “drift” within LLMs, wherein improving model performance in one area might change the outputs they produce in another domain ( 74 ). Finally, social scientists must consider broader forms of temporal validity ( 13 , 75 ). As LLMs evolve in response to user behavior in different ways—as well as ongoing events in the world—this will create another significant challenge for those who strive for reproducible research with Generative AI.…”

Section: Limitations and Possible Dangersmentioning

confidence: 99%

See 2 more Smart Citations

Can Generative AI improve social science?

Bail

2024

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

Generative AI that can produce realistic text, images, and other human-like outputs is currently transforming many different industries. Yet it is not yet known how such tools might influence social science research. I argue Generative AI has the potential to improve survey research, online experiments, automated content analyses, agent-based models, and other techniques commonly used to study human behavior. In the second section of this article, I discuss the many limitations of Generative. I examine how bias in the data used to train these tools can negatively impact social science research—as well as a range of other challenges related to ethics, replication, environmental impact, and the proliferation of low-quality research. I conclude by arguing that social scientists can address many of these limitations by creating open-source infrastructure for research on human behavior. Such infrastructure is not only necessary to ensure broad access to high-quality research tools, I argue, but also because the progress of AI will require deeper understanding of the social forces that guide human behavior.

show abstract

Synthetic Replacements for Human Survey Data? The Perils of Large Language Models

Bisbee,

Clinton,

Dorff

et al. 2024

Polit. Anal.

View full text Add to dashboard Cite

Large language models (LLMs) offer new research possibilities for social scientists, but their potential as “synthetic data” is still largely unknown. In this paper, we investigate how accurately the popular LLM ChatGPT can recover public opinion, prompting the LLM to adopt different “personas” and then provide feeling thermometer scores for 11 sociopolitical groups. The average scores generated by ChatGPT correspond closely to the averages in our baseline survey, the 2016–2020 American National Election Study (ANES). Nevertheless, sampling by ChatGPT is not reliable for statistical inference: there is less variation in responses than in the real surveys, and regression coefficients often differ significantly from equivalent estimates obtained using ANES data. We also document how the distribution of synthetic responses varies with minor changes in prompt wording, and we show how the same prompt yields significantly different results over a 3-month period. Altogether, our findings raise serious concerns about the quality, reliability, and reproducibility of synthetic survey data generated by LLMs.

show abstract

Synthetic Replacements for Human Survey Data? The Perils of Large Language Models

Cited by 8 publications

References 28 publications

Large language models as a substitute for human experts in annotating political text

Large language models as a substitute for human experts in annotating political text

Can Generative AI improve social science?

Synthetic Replacements for Human Survey Data? The Perils of Large Language Models

Contact Info

Product

Resources

About