Open source and reproducible and inexpensive infrastructure for data challenges and education

DeWitt, Peter E.; Rebull, Margaret A.; Bennett, Tellen D.

doi:10.1038/s41597-023-02854-0

Cited by 1 publication

(1 citation statement)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We noticed that Microsoft Azure GPT-4 outperformed AWS EC2 Llama 2 in terms of price, execution speed, and accuracy. However, as an open-source model, Llama 2 may have a better reproducibility 37 while GPT-4 may provide slightly different answers over time due to model updates from OpenAI 38 . Secondly, we tested several prompting strategies, and found doing error analysis using some training data, and then asking LLM to avoid summarized common errors by revising the prompt can be an effective way to improve LLM's performance.…”

Section: Discussionmentioning

confidence: 99%

Enhancing Early Detection of Cognitive Decline in the Elderly: A Comparative Study Utilizing Large Language Models in Clinical Notes

Du,

Novoa-Laurentiev,

Plasaek

et al. 2024

Preprint

View full text Add to dashboard Cite

Background: Early detection of cognitive decline in elderly individuals facilitates clinical trial enrollment and timely medical interventions. This study aims to apply, evaluate, and compare advanced natural language processing techniques for identifying signs of cognitive decline in clinical notes. Methods: This study, conducted at Mass General Brigham (MGB), Boston, MA, included clinical notes from the 4 years prior to initial mild cognitive impairment (MCI) diagnosis in 2019 for patients ≥ 50 years. Note sections regarding cognitive decline were labeled manually. A random sample of 4,949 note sections filtered with cognitive functions-related keywords were used for traditional AI model development, and 200 random subset were used for LLM and prompt development; another random sample of 1996 note sections without keyword filtering were used for testing. Prompt templates for large language models (LLM), Llama 2 on Amazon Web Service and GPT-4 on Microsoft Azure, were developed with multiple prompting approaches to select the optimal LLM-based method. Baseline comparisons were made with XGBoost and a hierarchical attention-based deep neural network model. An ensemble of the three models was then constructed using majority vote. Results: GPT-4 demonstrated superior accuracy and efficiency to Llama 2. The ensemble model outperformed individual models, achieving a precision of 90.3%, recall of 94.2%, and F1-score of 92.2%. Notably, the ensemble model demonstrated a marked improvement in precision (from a 70%-79% range to above 90%) compared to the best performing single model. Error analysis revealed 63 samples were wrongly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them. Conclusion: Our findings indicate that LLMs and traditional models exhibit diverse error profiles. The ensemble of LLMs and locally trained machine learning models on EHR data was found to be complementary, enhancing performance and improving diagnostic accuracy.

show abstract

Section: Discussionmentioning

confidence: 99%