A Model-Assisted Approach for Finding Coding Errors in Manual Coding of Open-Ended Questions

He, Zhoushanyue; Schonlau, Matthias

doi:10.1093/jssam/smab022

Cited by 6 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…c) The multi-label algorithms we have shown use SVMs as the base learner. However, we know that other algorithms such as gradient boosting and random forest perform similarly when classifying answers to open-ended questions (He and Schonlau, 2022;Gweon and Schonlau, 2023).…”

Section: Discussionmentioning

confidence: 99%

Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions

Schonlau

Gweon

Wenemark

2019

Social Science Computer Review

View full text Add to dashboard Cite

Text data from open-ended questions in surveys are challenging to analyze and are often ignored. Open-ended questions are important though because they do not constrain respondents’ answers. Where open-ended questions are necessary, often human coders manually code answers. When data sets are large, it is impractical or too costly to manually code all answer texts. Instead, text answers can be converted into numerical variables, and a statistical/machine learning algorithm can be trained on a subset of manually coded data. This statistical model is then used to predict the codes of the remainder. We consider open-ended questions where the answers are coded into multiple labels (all-that-apply questions). For example, in the open-ended question in our Happy example respondents are explicitly told they may list multiple things that make them happy. Algorithms for multilabel data take into account the correlation among the answer codes and may therefore give better prediction results. For example, when giving examples of civil disobedience, respondents talking about “minor nonviolent offenses” were also likely to talk about “crimes.” We compare the performance of two different multilabel algorithms (random k-labelsets [RAKEL], classifier chains [CC]) to the default method of binary relevance (BR) which applies single-label algorithms to each code separately. Performance is evaluated on data from three open-ended questions (Happy, Civil Disobedience, and Immigrant). We found weak bivariate label correlations in the Happy data (90th percentile: 7.6%), and stronger bivariate label correlations in the Civil Disobedience (90th percentile: 17.2%) and Immigrant (90th percentile: 19.2%) data. For the data with stronger correlations, we found both multilabel methods performed substantially better than BR using 0/1 loss (“at least one label is incorrect”) and had little effect when using Hamming loss (average error). For data with weak label correlations, we found no difference in performance between multilabel methods and BR. We conclude that automatic classification of open-ended questions that allow multiple answers may benefit from using multilabel algorithms for 0/1 loss. The degree of correlations among the labels may be a useful prognostic tool.

show abstract

Section: Discussionmentioning

confidence: 99%

Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions

Schonlau

Gweon

Wenemark

2019

Social Science Computer Review

View full text Add to dashboard Cite

show abstract

“…Many statistical learning algorithms are now available in statistical software like R and Python, and it is not possible to give a complete overview here (see e.g., Hao and Ho, 2019 , for a Python overview). However, we do want to point to some of the most popular choices that have been applied to classifying answers to open-ended questions: these include tree-based methods like random forests and boosting (Schonlau and Couper, 2016 ; Kern et al, 2019 ; Schierholz and Schonlau, 2021 ), support vector machines (SVM) (Joachims, 2001 ; Bullington et al, 2007 ; He and Schonlau, 2020 , 2021 ; Khanday et al, 2021 ), multinomial regression (Schierholz and Schonlau, 2021 ) and naïve Bayes classifiers (Severin et al, 2017 ; Paudel et al, 2018 ).…”

Section: Survey Motivation In the Gesis Panelmentioning

confidence: 99%

The semi-automatic classification of an open-ended question on panel survey motivation and its application in attrition analysis

Haensch¹,

Weiß²,

Steins³

et al. 2022

Front. Big Data

View full text Add to dashboard Cite

In this study, we demonstrate how supervised learning can extract interpretable survey motivation measurements from a large number of responses to an open-ended question. We manually coded a subsample of 5,000 responses to an open-ended question on survey motivation from the GESIS Panel (25,000 responses in total); we utilized supervised machine learning to classify the remaining responses. We can demonstrate that the responses on survey motivation in the GESIS Panel are particularly well suited for automated classification, since they are mostly one-dimensional. The evaluation of the test set also indicates very good overall performance. We present the pre-processing steps and methods we used for our data, and by discussing other popular options that might be more suitable in other cases, we also generalize beyond our use case. We also discuss various minor problems, such as a necessary spelling correction. Finally, we can showcase the analytic potential of the resulting categorization of panelists' motivation through an event history analysis of panel dropout. The analytical results allow a close look at respondents' motivations: they span a wide range, from the urge to help to interest in questions or the incentive and the wish to influence those in power through their participation. We conclude our paper by discussing the re-usability of the hand-coded responses for other surveys, including similar open questions to the GESIS Panel question.

show abstract

“…A common practice in analyzing qualitative data is to develop a coding scheme or framework to analyze data, train research assistants (RAs) to apply the framework, ensure sufficient inter-rater reliability, and then have RAs analyze the data [1][2][3]. Some researchers also discuss the use of machine learning or artificial intelligence to help throughout the qualitative data analysis process as another pathway to analyzing data [4][5][6][7][8][9]. This paper explores the idea of developing and using a computer program to assist in coding open-ended survey responses.…”

Section: Introductionmentioning

confidence: 99%

“…However, they did find correlations between the auto-and human-coded results suggesting that they both found the same responses easy or hard to code. The same authors have also attempted semi-automated coding methods to improve accuracy using machine learning to identify and code easy responses, leaving the more difficult ones to be manually coded [11] or to identify responses in a dataset with a high probability of error for further analysis via double-coding [8].…”

Section: Introductionmentioning

confidence: 99%

Developing a Program to Assist in Qualitative Data Analysis: How Engineering Students’ Discuss Model Types

Rodgers,

Verleger,

Marbouti

et al.

2022 ASEE Annual Conference &Amp; Exposition Proceedings

View full text Add to dashboard Cite

This Research paper discusses the opportunities that utilizing a computer program can present in analyzing large amounts of qualitative data collected through a survey tool. When working with longitudinal qualitative data, there are many challenges that researchers face. The coding scheme may evolve over time requiring re-coding of early data. There may be long periods of time between data analysis. Typically, multiple researchers will participate in the coding, but this may introduce bias or inconsistencies. Ideally the same researchers would be analyzing the data, but often there is some turnover in the team, particularly when students assist with the coding. Computer programs can enable automated or semi-automated coding helping to reduce errors and inconsistencies in the coded data.In this study, a modeling survey was developed to assess student awareness of model types and administered in four first-year engineering courses across the three universities over the span of three years. The data collected from this survey consists of over 4,000 students' open-ended responses to three questions about types of models in science, technology, engineering, and mathematics (STEM) fields. A coding scheme was developed to identify and categorize model types in student responses. Over two years, two undergraduate researchers analyzed a total of 1,829 students' survey responses after ensuring intercoder reliability was greater than 80% for each model category. However, with much data remaining to be coded, the research team developed a MATLAB program to automatically implement the coding scheme and identify the types of models students discussed in their responses.MATLAB coded results were compared to human-coded results (n = 1,829) to assess reliability; results matched between 81%-99% for the different model categories. Furthermore, the reliability of the MATLAB coded results are within the range of the interrater reliability measured between the 2 undergraduate researchers (86-100% for the five model categories).With good reliability of the program, all 4,358 survey responses were coded; results showing the number and types of models identified by students are presented in the paper.

show abstract

A Model-Assisted Approach for Finding Coding Errors in Manual Coding of Open-Ended Questions

Cited by 6 publications

References 8 publications

Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions

Automatic Classification of Open-Ended Questions: Check-All-That-Apply Questions

The semi-automatic classification of an open-ended question on panel survey motivation and its application in attrition analysis

Developing a Program to Assist in Qualitative Data Analysis: How Engineering Students’ Discuss Model Types

Contact Info

Product

Resources

About