BACKGROUND
Case studies have shown ChatGPT can run clinical simulations at the medical student level. However, no data have assessed ChatGPT’s reliability in meeting desired simulation criteria such as medical accuracy, simulation formatting, and robust feedback mechanisms.
OBJECTIVE
To quantify ChatGPT’s ability to consistently follow formatting instructions and create simulations for preclinical medical student learners according to principles of medical simulation and multimedia educational technology.
METHODS
Using ChatGPT-4 and a pre-validated starting prompt, the authors ran 360 separate simulations of an acute asthma exacerbation. 180 simulations were given correct answers and 180 were given incorrect answers. ChatGPT was evaluated for its ability to adhere to basic simulation parameters (stepwise progression, free response, interactivity), advanced simulation parameters (autonomous conclusion, delayed feedback, comprehensive feedback), and medical accuracy (vignette, treatment updates, feedback). Significance was determined with chi-squared analyses using 95% confidence intervals for odds ratios.
RESULTS
100% of simulations met basic simulation parameters and were medically accurate. For advanced parameters, 55% of all simulations delayed feedback, while the Correct arm (87%) delayed feedback significantly more than the Incorrect arm (24%) (p<0.001). 79% of simulations concluded autonomously, and there was no difference between the Correct and Incorrect arms in autonomous conclusion (81%, 77%; p=0.364). 78% of simulations gave comprehensive feedback, and there was no difference between the Correct and Incorrect arms in comprehensive feedback (76%, 81%; p=0.306). ChatGPT-4 was significantly more likely to conclude simulations autonomously (p<0.001) and provide comprehensive feedback (p<0.001) when feedback was delayed compared to when feedback was not delayed.
CONCLUSIONS
ChatGPT simulations performed perfectly on medical accuracy and basic simulation parameters. It performed well on comprehensive feedback and autonomous conclusion. Delayed feedback depended on the accuracy of user inputs. A simulation meeting one advanced parameter was more likely to meet all advanced parameters. These simulations have the potential to be a reliable educational tool for simple simulations and can be evaluated by a novel nine-part metric. Further work must be done to ensure consistent performance across a broader range of simulation scenarios.