Kenneth Benoit scite author profile

quanteda is an R package providing a comprehensive workflow and toolkit for natural language processing tasks such as corpus management, tokenization, analysis, and visualization. It has extensive functions for applying dictionary analysis, exploring texts using keywords-in-context, computing document and feature similarities, and discovering multi-word expressions through collocation scoring. Based entirely on sparse operations, it provides highly efficient methods for compiling document-feature matrices and for manipulating these or using them in further quantitative analysis. Using C++ and multithreading extensively, quanteda is also considerably faster and more efficient than other R and Python packages in processing large textual data. . quanteda: An R package for the quantitative analysis of textual data.

show abstract

Scaling Policy Preferences from Coded Political Texts

Lowe

Benoit

Mikhaylov

et al. 2011

Legislative Studies Qtrly

460

473

View full text Add to dashboard Cite

Scholars estimating policy positions from political texts typically code words or sentences and then build left-right policy scales based on the relative frequencies of text units coded into different categories. Here we reexamine such scales and propose a theoretically and linguistically superior alternative based on the logarithm of oddsratios. We contrast this scale with the current approach of the Comparative Manifesto Project (CMP), showing that our proposed logit scale avoids widely acknowledged flaws in previous approaches. We validate the new scale using independent expert surveys. Using existing CMP data, we show how to estimate more distinct policy dimensions, for more years, than has been possible before, and make this dataset publicly available. Finally, we draw some conclusions about the future design of coding schemes for political texts.l sq_6 123..156Almost anyone interested in party competition, whether this takes place in legislatures, the electoral arena, or government, needs sooner or later to estimate the policy positions of key political actors, whether these be individual legislators or the political parties to which they affiliate. Indeed, "how to best measure the policy preferences of individual legislators and of legislative parties" (Loewenberg 2008, 499) forms one of the central problems of legislative research. This is particularly true for scholars of comparative legislative research. While in the American settings policy preferences of legislators have been conceptualized as individual-level variables, tight party discipline in many non-American contexts makes it difficult to derive

show abstract

Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions

Benoit¹,

Laver

Mikhaylov³

2009

American J Political Sci

212

199

View full text Add to dashboard Cite

Political text offers extraordinary potential as a source of information about the policy positions of political actors. Despite recent advances in computational text analysis, human interpretative coding of text remains an important source of text-based data, ultimately required to validate more automatic techniques. The profession's main source of cross-national, time-series data on party policy positions comes from the human interpretative coding of party manifestos by the Comparative Manifesto Project (CMP). Despite widespread use of these data, the uncertainty associated with each point estimate has never been available, undermining the value of the dataset as a scientific resource. We propose a remedy. First, we characterize processes by which CMP data are generated. These include inherently stochastic processes of text authorship, as well as of the parsing and coding of observed text by humans. Second, we simulate these error generating processes by bootstrapping analyses of coded quasi-sentences. This allows us to estimate precise levels of non-systematic error for every category and scale reported by the CMP for its entire set of 3,000+ manifestos. Using our estimates of these errors, we show how to correct biased inferences, in recent prominently published work, derived from statistical analyses of error-contaminated CMP data.Key Words: Comparative Manifesto Project, mapping party positions, party policy, error estimates, measurement error. * This research was partly supported by the European Commission Fifth Framework (project number SERD-2002-00061), and by the Irish Research Council for Humanities and the Social Sciences. We thank Andrea Volkens for generously sharing her experience and data regarding the CMP; Thomas Daubler for research assistance; and Thomas Daubler, Gary King, Michael D.McDonald, Oli Proksch, and Jon Slapin for comments. We also thank James Adams, Garrett Glasgow, Simon Hix, Abdoul Noury, and Sona Golder for providing and assisting with their replication datasets and code.

show abstract

Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data

et al. 2016

View full text Add to dashboard Cite

Empirical social science often relies on data that are not observed in the field, but are transformed into quantitative variables by expert researchers who analyze and interpret qualitative raw sources. While generally considered the most valid way to produce data, this expert-driven process is inherently difficult to replicate or to assess on grounds of reliability. Using crowd-sourcing to distribute text for reading and interpretation by massive numbers of non-experts, we generate results comparable to those using experts to read and interpret the same texts, but do so far more quickly and flexibly. Crucially, the data we collect can be reproduced and extended transparently, making crowd-sourced datasets intrinsically reproducible. This focuses researchers' attention on the fundamental scientific objective of specifying reliable and replicable methods for collecting the data needed, rather than on the content of any particular dataset. We also show that our approach works straightforwardly with different types of political text, written in different languages. While findings reported here concern text analysis, they have far-reaching implications for expert-generated data in the social sciences. however, sets a far weaker standard than reproducibility of the data, which is typically seen as a fundamental principle of the scientific method. Here, we propose a step towards a more comprehensive scientific replication standard in which the mandate is to replicate data production, not just data analysis. This shifts attention from specific datasets as the essential scientific objects of interest, to the published and reproducible method by which the data were generated.We implement this more comprehensive replication standard for the rapidly expanding project of analyzing the content of political texts. Traditionally, a lot of political data is generated by experts applying comprehensive classification schemes to raw sources in a process that, while

show abstract

Estimating party policy positions: Comparing expert surveys and hand-coded content analysis

2007

View full text Add to dashboard Cite

Coder Reliability and Misclassification in the Human Coding of Party Manifestos

Mikhaylov

Laver²,

Benoit³

2012

Polit. anal.

182

146

View full text Add to dashboard Cite

Edited by R. Michael AlvarezThe Comparative Manifesto Project (CMP) provides the only time series of estimated party policy positions in political science and has been extensively used in a wide variety of applications. Recent work (e.g., Benoit, Laver, and Mikhaylov 2009;Klingemann et al. 2006) focuses on nonsystematic sources of error in these estimates that arise from the text generation process. Our concern here, by contrast, is with error that arises during the text coding process since nearly all manifestos are coded only once by a single coder. First, we discuss reliability and misclassification in the context of hand-coded content analysis methods. Second, we report results of a coding experiment that used trained human coders to code sample manifestos provided by the CMP, allowing us to estimate the reliability of both coders and coding categories. Third, we compare our test codings to the published CMP "gold standard" codings of the test documents to assess accuracy and produce empirical estimates of a misclassification matrix for each coding category. Finally, we demonstrate the effect of coding misclassification on the CMP's most widely used index, its left-right scale. Our findings indicate that misclassification is a serious and systemic problem with the current CMP data set and coding process, suggesting the CMP scheme should be significantly simplified to address reliability issues.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kenneth Benoit

Party Policy in Modern Democracies

Extracting Policy Positions from Political Texts Using Words as Data

quanteda: An R package for the quantitative analysis of textual data

Scaling Policy Preferences from Coded Political Texts

Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions

Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data

Estimating party policy positions: Comparing expert surveys and hand-coded content analysis

Coder Reliability and Misclassification in the Human Coding of Party Manifestos

Contact Info

Product

Resources

About