Evaluating large language models on a highly-specialized topic, radiation oncology physics

Holmes, Jason; Liu, Zhengliang; Zhang, Lian; Ding, Yuzhen; Sio, Terence T.; McGee, Lisa A.; Ashman, Jonathan B.; Li, Xiang; Liu, Tianming; Shen, Jiajian; Liu, Wei

doi:10.3389/fonc.2023.1219326

Cited by 51 publications

(20 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These comparisons highlighted the potential of ChatGPT in higher educational assessments; nevertheless, it showed the importance of ongoing refinements of these models and the dangers of inaccuracies it poses (Lo, 2023;Sallam, 2023;Sallam et al, 2023d;Gill et al, 2024). However, making direct comparisons across variable studies can be challenging due to differences in models implemented, subject fields of the exams, test dates, and the exact approaches of prompt construction (Holmes et al, 2023;Huynh Linda et al, 2023;Meskó, 2023;Oh et al, 2023;Skalidis et al, 2023;Yaa et al, 2023).…”

Section: Discussionmentioning

confidence: 99%

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.

show abstract

Section: Discussionmentioning

confidence: 99%

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

show abstract

“…Because of the inherent nature of their learning, LLMs predict the next token (word or phrase), which may or may not always be factually true. Despite these constraints, recent experiments with ChatGPT taking standardised tests have yielded remarkable results [34][35][36]. This demonstrated that ChatGPT, and LLMs in general have the emergent ability to perform critical reasoning, and answer complex questions.…”

Section: Llm As a Decision Support Toolmentioning

confidence: 99%

Exploring the role of large language models in radiation emergency response

Chandra,

Chakraborty

2024

J. Radiol. Prot.

View full text Add to dashboard Cite

In recent times, the field of artificial intelligence (AI) has been transformed by the introduction of large language models (LLMs). These models, popularized by OpenAI's GPT-3, have demonstrated the emergent capabilities of AI in comprehending and producing text resembling human language, which has helped them transform several industries. But its role has yet to be explored in the nuclear industry, specifically in managing radiation emergencies. The present work explores LLMs' contextual awareness, natural language interaction, and their capacity to comprehend diverse queries in a radiation emergency response setting. In this study we identify different user types and their specific LLM use-cases in radiation emergencies. Their possible interactions with ChatGPT, a popular LLM, has also been simulated and preliminary results are presented. Drawing on the insights gained from this exercise and to address concerns of reliability and misinformation, this study advocates for expert guided and domain-specific LLMs trained on radiation safety protocols and historical data. This study aims to guide radiation emergency management practitioners and decision-makers in effectively incorporating LLMs into their decision support framework.

show abstract

“…For instance, ChatGPT has shown remarkable accuracy in reasoning questions and medical exams [43,44], even successfully passing the Chinese Medical Licensing Exam [45] and the United States Medical Licensing Exam (USMLE) [46]. It also performed well in addressing radiation oncology physics exam questions [47]. Likewise, "ChatGPT would have been at the 87 th percentile of Bunting's 2013 international cohort for the Cardiff Fertility Knowledge Scale and at the 95 th percentile on the basis of Kudesia's 2017 cohort for the Fertility and Infertility Treatment Knowledge Score" [48].…”

Section: Medical Knowledge Inquirymentioning

confidence: 99%

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare

Wang,

Wan,

et al. 2024

Preprint

View full text Add to dashboard Cite

Background: The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective: This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods: We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results: Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions: Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

show abstract

Evaluating large language models on a highly-specialized topic, radiation oncology physics

Cited by 51 publications

References 21 publications

Below average ChatGPT performance in medical microbiology exam compared to university students

Below average ChatGPT performance in medical microbiology exam compared to university students

Exploring the role of large language models in radiation emergency response

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare

Contact Info

Product

Resources

About