Prior work has shown that large language models like GPT-4 and Med-PaLM 2 cananswersample questions from the USMLE Step 2 Clinical Knowledge (CK) exam with greater than 80% accuracy. But can these generative AIcreateUSMLE-like exam questions? This capability could augment humans in writing or preparing for such exams. Here we assess the ability of GPT-4 to generate realistic exam questions by asking licensed physicians to (1) distinguish AI-generated questions from genuine USMLE Step 2 CK questions, and (2) assess the validity of AI-generated questions and answers. We find that GPT-4 can generate question/answer pairs that are largely indistinguishable from human-generated ones, with a majority (64%) deemed “valid” by a panel of licensed physicians.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.