mation within token constraints. Longer responses do not necessarily improve accuracy or retrieve better information. As Schulte mentions, we designed our prompts to understand implications for patient self-education, and so our straightforward and broad prompts were appropriate for this intended use. 1 Additional prompt engineering might improve results; however, they would not represent the intended use that we were studying.The best way to improve LLM factuality is an open question, and prompt engineering is not necessarily the best approach. Prompt design is currently more of an art than a science and can be challenging, time-consuming, and costly. 2,3 In the future, automated approaches to normalizing prompts and soft prompting may alleviate some of these issues. 4,5 Methods such as developing more specialized clinical language models, linking to vetted knowledge sources, and fact-checking across different LLM instances are just a few other potential avenues to develop more factual question-answering systems.Robustness to minor alterations in prompts is a major issue for evaluating these LLMs and poses safety concerns. This is further compounded by a lack of transparency from OpenAI with respect to the data and methods used to train and evaluate their models and the fact that model weights and/or settings may be updated without notice when accessed via the browser interface. Transparency and reproducibility will be key for effective and safe implementation. In our study, the LLM was accessed only via the application programming interface, 6 allowing more control over settings and reliable reporting of which model was used. In addition, we make all of our data, including prompts and scores, publicly available, alongside clear definitions of the criteria used for our evaluation. 1 We propose that these steps should be standard for studies evaluating LLMs to ensure transparent and reproducible evaluation, which will form the foundation for safe clinical implementation in the future. We believe that our study contributes to the growing body of research on LLM performance in medicine that investigates both the strengths and deficiencies. Understanding this totality is the only way to meaningfully and safely integrate them into clinical practice.