Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program

Morjaria, Leo; Burns, Levi; Bracken, Keyna; Ngo, Quang N.; Lee, Mark; Levinson, Anthony J.; Smith, John; Thompson, Penelope; Sibbald, Matthew

doi:10.1177/23821205231204178

Cited by 7 publications

(6 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The frequency of changes between scoring categories (34-57%) suggests that relying solely on AI-based grading could sometimes overlook nuances that would be critical in a medical educational context. However, while not formally tested in this study, there also exists some level of inter-rater variability with independent human tutors; our group formally investigated in a previous work and found a Cronbach alpha value of 0.816 for a team of six human assessors on past student-generated CAE responses [29].…”

Section: Discussionmentioning

confidence: 85%

Examining the Efficacy of ChatGPT in Marking Short-Answer Assessments in an Undergraduate Medical Program

Morjaria,

Burns,

Bracken

et al. 2024

IME

Self Cite

View full text Add to dashboard Cite

Traditional approaches to marking short-answer questions face limitations in timeliness, scalability, inter-rater reliability, and faculty time costs. Harnessing generative artificial intelligence (AI) to address some of these shortcomings is attractive. This study aims to validate the use of ChatGPT for evaluating short-answer assessments in an undergraduate medical program. Ten questions from the pre-clerkship medical curriculum were randomly chosen, and for each, six previously marked student answers were collected. These sixty answers were evaluated by ChatGPT in July 2023 under four conditions: with both a rubric and standard, with only a standard, with only a rubric, and with neither. ChatGPT displayed good Spearman correlations with a single human assessor (r = 0.6–0.7, p < 0.001) across all conditions, with the absence of a standard or rubric yielding the best correlation. Scoring differences were common (65–80%), but score adjustments of more than one point were less frequent (20–38%). Notably, the absence of a rubric resulted in systematically higher scores (p < 0.001, partial η2 = 0.33). Our findings demonstrate that ChatGPT is a viable, though imperfect, assistant to human assessment, performing comparably to a single expert assessor. This study serves as a foundation for future research on AI-based assessment techniques with potential for further optimization and increased reliability.

show abstract

Section: Discussionmentioning

confidence: 85%

Examining the Efficacy of ChatGPT in Marking Short-Answer Assessments in an Undergraduate Medical Program

Morjaria,

Burns,

Bracken

et al. 2024

IME

Self Cite

View full text Add to dashboard Cite

show abstract

“…This poses comprehension challenges for students with low abilities (Guo & Wang, 2023). In addition, some teachers noticed that ChatGPT might use different evaluation criteria from their own, and its lack of specific knowledge about the class and students could lead to inappropriate feedback (Morjaria et al, 2023). These limitations indicated that although ChatGPT seemed to be powerful, it could not replace teacher feedback.…”

Section: Teacher Beliefs About Assessmentmentioning

confidence: 99%

“…For example, GenAI can be used to advance writing such as proofreading, critique, and editing (Currie et al, 2023), create personalized assessments, and simulate conversations (Cheung et al, 2023;Currie et al, 2023). Therefore, the teachers were encouraged to use alternative grading practices, like incorporating nontraditional, authentic assessments that are difficult for AI to replicate without prompting (Chaudhry et al, 2023;Fuchs et al, 2023;Morjaria et al, 2023;Overono & Ditta, 2023;Perkins, 2023). In other words, the students' language can be self-assessed through GenAI, while the students' ideas and logical thinking can be assessed by teachers.…”

Section: Balancing Genai and Human Assessmentmentioning

confidence: 99%

“…The use of AI tools like ChatGPT in education requires rethinking assessment policies (Gorichanaz, 2023;Morjaria et al, 2023), as traditional examinations and assignments may become obsolete due to AI-generated content (Geerling et al, 2023;Gorichanaz, 2023). When making policies for assessment, educational institutions should carefully rethink what assessment approaches are needed to ensure academic integrity and meaningful learning (Perkins, 2023;Rajabi et al, 2023).…”

Section: Redesigning Assessment Policiesmentioning

confidence: 99%

See 1 more Smart Citation

A scoping review on how generative artificial intelligence transforms assessment in higher education

Xia,

Weng,

Ouyang

et al. 2024

Int J Educ Technol High Educ

View full text Add to dashboard Cite

Generative artificial intelligence provides both opportunities and challenges for higher education. Existing literature has not properly investigated how this technology would impact assessment in higher education. This scoping review took a forward-thinking approach to investigate how generative artificial intelligence transforms assessment in higher education. We used the PRISMA extension for scoping reviews to select articles for review and report the results. In the screening, we retrieved 969 articles and selected 32 empirical studies for analysis. Most of the articles were published in 2023. We used three levels—students, teachers, and institutions—to analyses the articles. Our results suggested that assessment should be transformed to cultivate students’ self-regulated learning skills, responsible learning, and integrity. To successfully transform assessment in higher education, the review suggested that (i) teacher professional development activities for assessment, AI, and digital literacy should be provided, (ii) teachers’ beliefs about human and AI assessment should be strengthened, and (iii) teachers should be innovative and holistic in their teaching to reflect the assessment transformation. Educational institutions are recommended to review and rethink their assessment policies, as well as provide more inter-disciplinary programs and teaching.

show abstract

“…ChatGPT has been compared to human raters in terms of grading short-answer pre-clerkship medical questions. The ChatGPT-human Spearman correlations for a single assessor ranged from 0.6 to 0.7 [12].…”

Section: Introductionmentioning

confidence: 99%

Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study

Burke,

Hoang,

Lopreiato

et al. 2024

JMIR Med Educ

View full text Add to dashboard Cite

Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective:The objective of this project is to assess the ability of ChatGPT 3.5 (ChatGPT) to score medical students' free text history and physical notes.Methods: This is a single institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free text history and physical note of their interaction. ChatGPT is a large language model (LLM). The students' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results:The study population consisted of 168 first year medical students. There was a total of 14,280 scores. The standardized patient incorrect scoring rate (error) was 7.2% and the ChatGPT incorrect scoring rate was 1.0%. The ChatGPT error rate was 86% lower than the standardized patient error rate. The standardized patient mean incorrect scoring rate of 85 (SD 74) was significantly higher than the ChatGPT mean incorrect scoring rate of 12 (SD 11), p = 0.002. Conclusions:ChatGPT had a significantly lower error rate than the standardized patients. This suggests that an LLM can be used to score medical students' notes. Furthermore, it is expected that, in the near future, LLM programs will provide real time feedback to practicing physicians regarding their free text notes. Generative pretrained transformer artificial intelligence programs represent an important advance in medical education and in the practice of medicine.

show abstract

Examining the Threat of ChatGPT to the Validity of Short Answer Assessments in an Undergraduate Medical Program

Cited by 7 publications

References 30 publications

Examining the Efficacy of ChatGPT in Marking Short-Answer Assessments in an Undergraduate Medical Program

Examining the Efficacy of ChatGPT in Marking Short-Answer Assessments in an Undergraduate Medical Program

A scoping review on how generative artificial intelligence transforms assessment in higher education

Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study

Contact Info

Product

Resources

About