Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models

Kung, Tiffany H; Cheatham, Morgan; Medenilla, Arielle; Sillos, Czarina; Leon, Lorie De; Elepaño, Camille; Madriaga, Maria; Aggabao, Rimel; Diaz-Candido, Giezel; Maningo, James; Tseng, Victor

doi:10.1101/2022.12.19.22283643

Cited by 176 publications

(170 citation statements)

References 19 publications

(12 reference statements)

Supporting

Mentioning

118

Contrasting

Unclassified

Order By: Relevance

“…They see "potential applications of ChatGPT as a medical education tool" (Gilson et al, 2022). Kung et al (2022) also tested ChatGPT on the USMLE and arrived at similar results and conclusions. Bommarito & Katz (2022) found earlier that GPT-3 was able to pass a U.S. Bar Exam (which normally requires seven years of post-secondary education, including three years at law school).…”

Section: Methods and Literature Reviewmentioning

confidence: 70%

See 1 more Smart Citation

ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?

2023

JALT

465

153

View full text Add to dashboard Cite

show abstract

Section: Methods and Literature Reviewmentioning

confidence: 70%

“…Only a fraction of these random tests is discussed in the next section. Unlike other recent academic articles and editorials (King & ChatGPT, 2023;Kung et al, 2022;O'Connor & ChatGPT, 2023), ChatGPT is not a co-author of our article, and we used the chatbot only very sparingly for brainstorming.…”

Section: Methods and Literature Reviewmentioning

confidence: 99%

ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?

2023

JALT

465

153

View full text Add to dashboard Cite

show abstract

“…We suggest clear disclosure when a manuscript is written with assistance from ChatGPT; 26 some have even included it as a co-author. 27 Reassuringly, there are patterns that allow it to be detected by AI output detectors. Though there is ongoing work to embed watermarks in output, until this is standardized and robust against scrubbing, we suggest running journal and conference abstract submissions through AI output detectors as part of the research editorial process to protect from targeting by organizations such as paper mills.…”

Section: Discussionmentioning

confidence: 99%

“…25 We suggest clear disclosure when a manuscript is written with assistance from ChatGPT; 26 some have even included it as a co-author. 27 Reassuringly, there are patterns that allow it to be detected by AI output detectors.…”

Section: Discussionmentioning

confidence: 99%

Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers

Gao

Howard

Markov

et al. 2022

Preprint

257

155

View full text Add to dashboard Cite

Background: Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. Methods: We gathered ten research abstracts from five high impact factor medical journals (n=50) and asked ChatGPT to generate research abstracts based on their titles and journals. We evaluated the abstracts using an artificial intelligence (AI) output detector, plagiarism detector, and had blinded human reviewers try to distinguish whether abstracts were original or generated. Results: All ChatGPT-generated abstracts were written clearly but only 8% correctly followed the specific journal's formatting requirements. Most generated abstracts were detected using the AI output detector, with scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% [12.73, 99.98] compared with very low probability of AI-generated output in the original abstracts of 0.02% [0.02, 0.09]. The AUROC of the AI output detector was 0.94. Generated abstracts scored very high on originality using the plagiarism detector (100% [100, 100] originality). Generated abstracts had a similar patient cohort size as original abstracts, though the exact numbers were fabricated. When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, but that the generated abstracts were vaguer and had a formulaic feel to the writing. Conclusion: ChatGPT writes believable scientific abstracts, though with completely generated data. These are original without any plagiarism detected but are often identifiable using an AI output detector and skeptical human reviewers. Abstract evaluation for journals and medical conferences must adapt policy and practice to maintain rigorous scientific standards; we suggest inclusion of AI output detectors in the editorial process and clear disclosure if these technologies are used. The boundaries of ethical and acceptable use of large language models to help scientific writing remain to be determined.

show abstract

“…For example, when given a mixture of original and ChatGPT-generated medical scientific abstracts, blinded medical researchers could identify only 68% of the ChatGPTgenerated abstracts as fabricated [6]. Other research evaluated the performance of ChatGPT on the United States Medical Licensing Exam (USMLE) and found that the tool performed near or at the passing threshold [7]. Taken together, there is anecdotal evidence that ChatGPT generates content that is very similar to and can hardly be discriminated from human-generated content.…”

Section: Related Literaturementioning

confidence: 99%

People devalue generative AI’s competence but not its advice in addressing societal and personal challenges

Böhm¹,

Jörling²,

Reiter³

et al. 2023

Preprint

View full text Add to dashboard Cite

The release of ChatGPT has received significant attention from both scientists and the public. Despite its acknowledged capabilities and potential applications, the perception and reaction of individuals to content generated by ChatGPT is not well understood. To address this, we focus on two important application domains: recommendations for (i) societal challenges and (ii) personal challenges. In two preregistered experimental studies, we investigate how individuals evaluate the author’s competence, the quality of the content, and their intention to share or follow the recommendations provided. Study 1 (N = 1,003) demonstrates that when individuals are (vs. are not) aware of the author’s identity, they devalue the author’s competence but not the content or the intention to share the recommendation for societal challenges provided by ChatGPT (vs. a human expert). Study 2 (N = 501) replicates the devaluation of ChatGPT’s competence when its identity is (vs. is not) known in the context of self-relevant personal challenges. It further suggests that more negative evaluations of the author do not negatively affect the likelihood of following recommendations by ChatGPT. Overall, these results provide insights into the potential acceptance of ChatGPT and have implications for the literature on algorithm aversion.

show abstract

Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models

Cited by 176 publications

References 19 publications

ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?

ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?

Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers

People devalue generative AI’s competence but not its advice in addressing societal and personal challenges

Contact Info

Product

Resources

About