The shaky foundations of large language models and foundation models for electronic health records

Wornow, Michael; Xu, Yizhe; Thapa, Rahul; Patel, Birju; Steinberg, Ethan; Fleming, Scott L.; Pfeffer, Michael; Fries, Jason; Shah, Nigam H.

doi:10.1038/s41746-023-00879-8

Cited by 84 publications

(42 citation statements)

References 74 publications

(78 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The purported benefits need to be defined and evaluations conducted to verify such benefits. 8 Only after these evaluations are completed should statements be allowed such as an LLM was used for a defined task in this specific workflow, it measured a metric, and observed an improvement (or deterioration) in a prespecified outcome. Such evaluations also are necessary to clarify the medicolegal risks that might occur with the use of LLMs to guide medical care, 11 and to identify mitigation strategies for the models' tendency to generate factually incorrect outputs that are probabilistically plausible (called hallucinations).…”

Section: Are the Purported Value Propositions Of Using Llms In Medici...mentioning

confidence: 99%

Creation and Adoption of Large Language Models in Medicine

Shah,

Entwistle,

Pfeffer

2023

JAMA

Self Cite

128

View full text Add to dashboard Cite

ImportanceThere is increased interest in and potential benefits from using large language models (LLMs) in medicine. However, by simply wondering how the LLMs and the applications powered by them will reshape medicine instead of getting actively involved, the agency in shaping how these tools can be used in medicine is lost.ObservationsApplications powered by LLMs are increasingly used to perform medical tasks without the underlying language model being trained on medical records and without verifying their purported benefit in performing those tasks.Conclusions and RelevanceThe creation and use of LLMs in medicine need to be actively shaped by provisioning relevant training data, specifying the desired benefits, and evaluating the benefits via testing in real-world deployments.

show abstract

Section: Are the Purported Value Propositions Of Using Llms In Medici...mentioning

confidence: 99%

Creation and Adoption of Large Language Models in Medicine

Shah,

Entwistle,

Pfeffer

2023

JAMA

Self Cite

128

View full text Add to dashboard Cite

show abstract

“…New and revolutionary technologies are often met with excitement about their many potential uses, leading to widespread and often unfocussed experimentation across different healthcare applications. Thus, as expected, the performance of LLMs in real-world healthcare settings too, remains inconsistently conducted and evaluated 11 12 . For instance, Cadamuro et al assessed ChatGPT-4’s diagnostic ability by evaluating relevance, correctness, helpfulness, and safety, finding responses to be generally superficial and sometimes inaccurate, lacking in helpfulness and safety 13 .…”

Section: Introductionmentioning

confidence: 99%

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

Bedi,

Liu,

Orr-Ewing

et al. 2024

Preprint

Self Cite

View full text Add to dashboard Cite

Importance: Large Language Models (LLMs) can assist in a wide range of healthcare-related activities. Current approaches to evaluating LLMs make it difficult to identify the most impactful LLM application areas. Objective: To summarize the current evaluation of LLMs in healthcare in terms of 5 components: evaluation data type, healthcare task, Natural Language Processing (NLP)/Natural Language Understanding (NLU) task, dimension of evaluation, and medical specialty. Data Sources: A systematic search of PubMed and Web of Science was performed for studies published between 01-01-2022 and 02-19-2024. Study Selection: Studies evaluating one or more LLMs in healthcare. Data Extraction and Synthesis: Three independent reviewers categorized 519 studies in terms of data used in the evaluation, the healthcare tasks (the what) and the NLP/NLU tasks (the how) examined, the dimension(s) of evaluation, and the medical specialty studied. Results: Only 5% of reviewed studies utilized real patient care data for LLM evaluation. The most popular healthcare tasks were assessing medical knowledge (e.g. answering medical licensing exam questions, 44.5%), followed by making diagnoses (19.5%), and educating patients (17.7%). Administrative tasks such as assigning provider billing codes (0.2%), writing prescriptions (0.2%), generating clinical referrals (0.6%) and clinical notetaking (0.8%) were less studied. For NLP/NLU tasks, the vast majority of studies examined question answering (84.2%). Other tasks such as summarization (8.9%), conversational dialogue (3.3%), and translation (3.1%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias and toxicity (15.8%), robustness (14.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in internal medicine (42%), surgery (11.4%) and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%) and medical genetics (0.2%) being the least represented. Conclusions and Relevance: Existing evaluations of LLMs mostly focused on accuracy of question answering for medical exams, without consideration of real patient care data. Dimensions like fairness, bias and toxicity, robustness, and deployment considerations received limited attention. To draw meaningful conclusions and improve LLM adoption, future studies need to establish a standardized set of LLM applications and evaluation dimensions, perform evaluations using data from routine care, and broaden testing to include administrative tasks as well as multiple medical specialties. Keywords: Large Language Models, Generative Artificial Intelligence, Healthcare, Dimensions of Evaluation, Evaluation Metrics.

show abstract

“…Despite these promising advances, research has yet to systematically develop a simple yet effective framework for learning high-quality representations crucial for robust cell clustering. The learning of high-quality representations is built upon the success of pretraining generalizable models, which aligns with the promise of the current foundation models (i.e., a series of large-scale pretrained models that can be applied on various downstream use cases and tasks) [19, 36, 20]. The foundation model has been instrumental in our understanding of the role of deep learning in the biological context.…”

Section: Introductionmentioning

confidence: 99%

Graph Contrastive Learning as a Versatile Foundation for Advanced scRNA-seq Data Analysis

Zhang,

Liu,

Xiao

et al. 2024

Preprint

View full text Add to dashboard Cite

Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a foundation model for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.

show abstract

The shaky foundations of large language models and foundation models for electronic health records

Cited by 84 publications

References 74 publications

Creation and Adoption of Large Language Models in Medicine

Creation and Adoption of Large Language Models in Medicine

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

Graph Contrastive Learning as a Versatile Foundation for Advanced scRNA-seq Data Analysis

Contact Info

Product

Resources

About