Linguistic Wisdom from the Crowd

Chang, Nancy; Lee-Goldman, Russell; Tseng, Michael S.

doi:10.1609/hcomp.v3i1.13266

Cited by 9 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We relied on a crowd (or ensemble) scoring strategy, 19 where scores were averaged across evaluators for each exchange studied. This method is used when there is no ground truth in the outcome being studied, and the evaluated outcomes themselves are inherently subjective (eg, judging figure skating, National Institutes of Health grants, concept discovery).…”

Section: Key Pointsmentioning

confidence: 99%

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum

et al. 2023

View full text Add to dashboard Cite

ImportanceThe rapid expansion of virtual health care has caused a surge in patient messages concomitant with more work and burnout among health care professionals. Artificial intelligence (AI) assistants could potentially aid in creating answers to patient questions by drafting responses that could be reviewed by clinicians.ObjectiveTo evaluate the ability of an AI chatbot assistant (ChatGPT), released in November 2022, to provide quality and empathetic responses to patient questions.Design, Setting, and ParticipantsIn this cross-sectional study, a public and nonidentifiable database of questions from a public social media forum (Reddit’s r/AskDocs) was used to randomly draw 195 exchanges from October 2022 where a verified physician responded to a public question. Chatbot responses were generated by entering the original question into a fresh session (without prior questions having been asked in the session) on December 22 and 23, 2022. The original question along with anonymized and randomly ordered physician and chatbot responses were evaluated in triplicate by a team of licensed health care professionals. Evaluators chose “which response was better” and judged both “the quality of information provided” (very poor, poor, acceptable, good, or very good) and “the empathy or bedside manner provided” (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic). Mean outcomes were ordered on a 1 to 5 scale and compared between chatbot and physicians.ResultsOf the 195 questions and responses, evaluators preferred chatbot responses to physician responses in 78.6% (95% CI, 75.0%-81.8%) of the 585 evaluations. Mean (IQR) physician responses were significantly shorter than chatbot responses (52 [17-62] words vs 211 [168-245] words; t = 25.4; P &lt; .001). Chatbot responses were rated of significantly higher quality than physician responses (t = 13.3; P &lt; .001). The proportion of responses rated as good or very good quality (≥ 4), for instance, was higher for chatbot than physicians (chatbot: 78.5%, 95% CI, 72.3%-84.1%; physicians: 22.1%, 95% CI, 16.4%-28.2%;). This amounted to 3.6 times higher prevalence of good or very good quality responses for the chatbot. Chatbot responses were also rated significantly more empathetic than physician responses (t = 18.9; P &lt; .001). The proportion of responses rated empathetic or very empathetic (≥4) was higher for chatbot than for physicians (physicians: 4.6%, 95% CI, 2.1%-7.7%; chatbot: 45.1%, 95% CI, 38.5%-51.8%; physicians: 4.6%, 95% CI, 2.1%-7.7%). This amounted to 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot.ConclusionsIn this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an online forum. Further exploration of this technology is warranted in clinical settings, such as using chatbot to draft responses that physicians could then edit. Randomized trials could assess further if using AI assistants might improve responses, lower clinician burnout, and improve patient outcomes.

show abstract

Section: Key Pointsmentioning

confidence: 99%

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Due to the subjectivity of rating accuracy and comprehensiveness, we employed an ensemble (or crowd sourcing) scoring strategy, 32 , 33 by averaging the ratings of the five reviewers for each response. This is comparable to a panel of judges averaging their scores for a performance.…”

Section: Methodsmentioning

confidence: 99%

Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients?

Wright,

Bodnar,

Moore

et al. 2024

Bone Jt Open

View full text Add to dashboard Cite

AimsWhile internet search engines have been the primary information source for patients’ questions, artificial intelligence large language models like ChatGPT are trending towards becoming the new primary source. The purpose of this study was to determine if ChatGPT can answer patient questions about total hip (THA) and knee arthroplasty (TKA) with consistent accuracy, comprehensiveness, and easy readability.MethodsWe posed the 20 most Google-searched questions about THA and TKA, plus ten additional postoperative questions, to ChatGPT. Each question was asked twice to evaluate for consistency in quality. Following each response, we responded with, “Please explain so it is easier to understand,” to evaluate ChatGPT’s ability to reduce response reading grade level, measured as Flesch-Kincaid Grade Level (FKGL). Five resident physicians rated the 120 responses on 1 to 5 accuracy and comprehensiveness scales. Additionally, they answered a “yes” or “no” question regarding acceptability. Mean scores were calculated for each question, and responses were deemed acceptable if ≥ four raters answered “yes.”ResultsThe mean accuracy and comprehensiveness scores were 4.26 (95% confidence interval (CI) 4.19 to 4.33) and 3.79 (95% CI 3.69 to 3.89), respectively. Out of all the responses, 59.2% (71/120; 95% CI 50.0% to 67.7%) were acceptable. ChatGPT was consistent when asked the same question twice, giving no significant difference in accuracy (t = 0.821; p = 0.415), comprehensiveness (t = 1.387; p = 0.171), acceptability (χ2 = 1.832; p = 0.176), and FKGL (t = 0.264; p = 0.793). There was a significantly lower FKGL (t = 2.204; p = 0.029) for easier responses (11.14; 95% CI 10.57 to 11.71) than original responses (12.15; 95% CI 11.45 to 12.85).ConclusionChatGPT answered THA and TKA patient questions with accuracy comparable to previous reports of websites, with adequate comprehensiveness, but with limited acceptability as the sole information source. ChatGPT has potential for answering patient questions about THA and TKA, but needs improvement.Cite this article: Bone Jt Open 2024;5(2):139–146.

show abstract

“…These scores were combined for a composite score that equally weighted the risks, benefits, alternatives, and overall impression subscores. We used an ensemble scoring strategy, averaging scores across reviewers for each response, as has been used previously in similar studies …”

Section: Methodsmentioning

confidence: 99%

“…We used an ensemble scoring strategy, averaging scores across reviewers for each response, as has been used previously in similar studies. 24,34 A multidisciplinary group of surgeons, including acute care surgery, surgical critical care, vascular surgery, orthopedic surgery, and surgical oncology, reviewed the LLM-based chatbot-and surgeon-generated procedure-specific RBAs for accuracy and completeness using the process described. Reviewers were blinded to the source of the RBA; each response was scored by at least 2 individual reviewers.…”

Section: Measurements Of Accuracy and Completenessmentioning

confidence: 99%

Large Language Model−Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures

Decker,

Trang,

Ramirez

et al. 2023

JAMA Netw Open

View full text Add to dashboard Cite

ImportanceInformed consent is a critical component of patient care before invasive procedures, yet it is frequently inadequate. Electronic consent forms have the potential to facilitate patient comprehension if they provide information that is readable, accurate, and complete; it is not known if large language model (LLM)-based chatbots may improve informed consent documentation by generating accurate and complete information that is easily understood by patients.ObjectiveTo compare the readability, accuracy, and completeness of LLM-based chatbot- vs surgeon-generated information on the risks, benefits, and alternatives (RBAs) of common surgical procedures.Design, Setting, and ParticipantsThis cross-sectional study compared randomly selected surgeon-generated RBAs used in signed electronic consent forms at an academic referral center in San Francisco with LLM-based chatbot-generated (ChatGPT-3.5, OpenAI) RBAs for 6 surgical procedures (colectomy, coronary artery bypass graft, laparoscopic cholecystectomy, inguinal hernia repair, knee arthroplasty, and spinal fusion).Main Outcomes and MeasuresReadability was measured using previously validated scales (Flesh-Kincaid grade level, Gunning Fog index, the Simple Measure of Gobbledygook, and the Coleman-Liau index). Scores range from 0 to greater than 20 to indicate the years of education required to understand a text. Accuracy and completeness were assessed using a rubric developed with recommendations from LeapFrog, the Joint Commission, and the American College of Surgeons. Both composite and RBA subgroup scores were compared.ResultsThe total sample consisted of 36 RBAs, with 1 RBA generated by the LLM-based chatbot and 5 RBAs generated by a surgeon for each of the 6 surgical procedures. The mean (SD) readability score for the LLM-based chatbot RBAs was 12.9 (2.0) vs 15.7 (4.0) for surgeon-generated RBAs (P = .10). The mean (SD) composite completeness and accuracy score was lower for surgeons’ RBAs at 1.6 (0.5) than for LLM-based chatbot RBAs at 2.2 (0.4) (P &lt; .001). The LLM-based chatbot scores were higher than the surgeon-generated scores for descriptions of the benefits of surgery (2.3 [0.7] vs 1.4 [0.7]; P &lt; .001) and alternatives to surgery (2.7 [0.5] vs 1.4 [0.7]; P &lt; .001). There was no significant difference in chatbot vs surgeon RBA scores for risks of surgery (1.7 [0.5] vs 1.7 [0.4]; P = .38).Conclusions and RelevanceThe findings of this cross-sectional study suggest that despite not being perfect, LLM-based chatbots have the potential to enhance informed consent documentation. If an LLM were embedded in electronic health records in a manner compliant with the Health Insurance Portability and Accountability Act, it could be used to provide personalized risk information while easing documentation burden for physicians.

show abstract

Linguistic Wisdom from the Crowd

Cited by 9 publications

References 0 publications

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum

Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum

Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients?

Large Language Model−Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures

Contact Info

Product

Resources

About