Performance of ChatGPT on the Situational Judgement Test—A Professional Dilemmas–Based Examination for Doctors in the United Kingdom

Borchert, Robin; Hickman, Charlotte Rachel; Pepys, J.; Sadler, Timothy J

doi:10.2196/48978

Cited by 18 publications

(10 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A plausible explanation for this discrepancy can be related to different question styles and different exam settings. Taken together, this highlights the need to assess the performance of AI-based models in various disciplines, using different questions' format, and compared to human performance (Borchert et al, 2023;Chen et al, 2023;Deiana et al, 2023;Flores-Cohaila et al, 2023;Puladi et al, 2023). Finally, it is important to acknowledge the limitations inherent in this study.…”

Section: Discussionmentioning

confidence: 91%

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.

show abstract

Section: Discussionmentioning

confidence: 91%

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

show abstract

“…In many cases, the LLM responses scored at or above the 90 th percentile, compared to human test takers. Independent research teams have subsequently examined the performance of LLMs in relation to, for example, medical situational judgment tests (Borchert et al, 2023), medical knowledge and 'soft skills' assessments (Brin et al, 2023), and both single-stimulus and forced-choice personality assessments (Phillips & Robie, 2024). These studies' findings align with OpenAI's report: advanced LLMs out-score most human test takers on several types of tests.…”

Section: The Performance Of Large Language Models On Quantitative And...mentioning

confidence: 72%

“…Published research suggests that LLMs often achieve high scores on tests that comprise extensive verbal information. As noted above, advanced LLMs appear to achieve higher scores than the majority of humans on many knowledge-based assessments (OpenAI, 2023), certain situational judgement tests (Arctic Shores, 2023b;Borchert et al, 2023), and personality questionnaires (if prompted to; Arctic Shores, 2023a;Phillips & Robie, 2024). Further, Elyoseph et al (2023) found that ChatGPT (GPT-3.5) outperformed most humans on an emotional awareness assessment comprising items with text descriptions of situations that required participants to identify an emotional state.…”

Section: The Test-taking Capabilities Of Large Language Modelsmentioning

confidence: 85%

The Performance of Large Language Models on Quantitative and Verbal Ability Tests: Initial Evidence and Implications for Unproctored High-stakes Testing

Hickman,

Dunlop,

Wolf

2024

Preprint

View full text Add to dashboard Cite

Abstract. Unproctored assessments are widely used in pre-employment assessment. However, the recent emergence of widely accessible large language models (LLMs) poses challenges for unproctored personnel assessments, given that applicants may use them to artificially inflate their scores beyond their true abilities. This may be particularly concerning in cognitive ability testing, which is widely used and is less fakeable by humans than personality tests. Thus, this study compares the performance of LLMs on two common types of cognitive tests: quantitative ability and verbal ability. The particular tests investigated are used in real-world, high-stakes selection. We also examine the performance of the LLMs across different test formats (i.e., open-ended vs. multiple choice). Further, we contrast the performance of two LLMs (GPT 3.5 and GPT 4) across multiple prompt approaches and temperature settings. We find that the LLMs score much better, in terms of percentile scores, on the verbal ability test than the quantitative ability test, even when accounting for the test format. GPT 4 outperforms GPT 3.5 across both types of tests. Notably, although prompt approaches and temperature settings do affect LLM test performance, the effects are minor relative to differences across tests and language models. We provide recommendations for securing pre-employment testing against LLM influences. Additionally, we call for rigorous research investigating the prevalence of LLM usage in pre-employment testing as well as on how LLM usage influences selection test validity.

show abstract

“…The second category involves the assessment of ChatGPT’s knowledge accuracy through testing, including examinations such as United States Medical Licensing Examination, the Situational Judgement Test, and subject tests in medical school [ 8 - 12 ]. In this study, the majority of the students who responded regarding the feedback provided by ChatGPT stated that it demonstrated a high degree of accuracy.…”

Section: Discussionmentioning

confidence: 99%

Medical students’ patterns of using ChatGPT as a feedback tool and perceptions of ChatGPT in a Leadership and Communication course in Korea: a cross-sectional study

Park

2023

J Educ Eval Health Prof

View full text Add to dashboard Cite

Purpose: This study aimed to analyze patterns of using ChatGPT before and after group activities and to explore medical students’ perceptions of ChatGPT as a feedback tool in the classroom.Methods: The study included 99 2nd-year pre-medical students who participated in a “Leadership and Communication” course from March to June 2023. Students engaged in both individual and group activities related to negotiation strategies. ChatGPT was used to provide feedback on their solutions. A survey was administered to assess students’ perceptions of ChatGPT’s feedback, its use in the classroom, and the strengths and challenges of ChatGPT from May 17 to 19, 2023.Results: The students responded by indicating that ChatGPT’s feedback was helpful, and revised and resubmitted their group answers in various ways after receiving feedback. The majority of respondents expressed agreement with the use of ChatGPT during class. The most common response concerning the appropriate context of using ChatGPT’s feedback was “after the first round of discussion, for revisions.” There was a significant difference in satisfaction with ChatGPT’s feedback, including correctness, usefulness, and ethics, depending on whether or not ChatGPT was used during class, but there was no significant difference according to gender or whether students had previous experience with ChatGPT. The strongest advantages were “providing answers to questions” and “summarizing information,” and the worst disadvantage was “producing information without supporting evidence.”Conclusion: The students were aware of the advantages and disadvantages of ChatGPT, and they had a positive attitude toward using ChatGPT in the classroom.

show abstract

Performance of ChatGPT on the Situational Judgement Test—A Professional Dilemmas–Based Examination for Doctors in the United Kingdom

Cited by 18 publications

References 14 publications

Below average ChatGPT performance in medical microbiology exam compared to university students

Below average ChatGPT performance in medical microbiology exam compared to university students

The Performance of Large Language Models on Quantitative and Verbal Ability Tests: Initial Evidence and Implications for Unproctored High-stakes Testing

Medical students’ patterns of using ChatGPT as a feedback tool and perceptions of ChatGPT in a Leadership and Communication course in Korea: a cross-sectional study

Contact Info

Product

Resources

About