Transformer-based models for transfer learning have the potential to achieve high prediction accuracies on text-based supervised learning tasks with relatively few training data instances. These models are thus likely to benefit social scientists that seek to have as accurate as possible text-based measures, but only have limited resources for annotating training data. To enable social scientists to leverage these potential benefits for their research, this article explains how these methods work, why they might be advantageous, and what their limitations are. Additionally, three Transformer-based models for transfer learning, BERT, RoBERTa, and the Longformer, are compared to conventional machine learning algorithms on three applications. Across all evaluated tasks, textual styles, and training data set sizes, the conventional models are consistently outperformed by transfer learning with Transformers, thereby demonstrating the benefits these models can bring to text-based social science research.
One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477–5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/ ). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.
During the last years, there have been substantial increases in the prediction performances of natural language processing models on text-based supervised learning tasks. Especially deep learning models that are based on the Transformer architecture (Vaswani et al., 2017) and are used in a transfer learning setting have contributed to this development. As Transformer-based models for transfer learning have the potential to achieve higher prediction accuracies with relatively few training data instances, they are likely to benefit social scientists that seek to have as accurate as possible text-based measures but only have limited resources for annotating training data. To enable social scientists to leverage these potential benefits for their research, this paper explains how these methods work, why they might be advantageous, and what their limitations are. Additionally, three Transformer-based models for transfer learning, BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and the Longformer (Beltagy et al., 2020), are compared to conventional machine learning algorithms on three social science applications. Across all evaluated tasks, textual styles, and training data set sizes, the conventional models are consistently outperformed by transfer learning with Transformer-based models, thereby demonstrating the potential benefits these models can bring to text-based social science research.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.