BACKGROUND
The global obesity epidemic continues to pose significant health challenges, with an increasing number of individuals seeking Metabolic and Bariatric Surgery (MBS), particularly laparoscopic sleeve gastrectomy (LSG). However, many MBS centers face resource limitations that limit the availability of adequate patient education, leading to knowledge gaps among patients who require more information. This consequently leads patients to seek information online increasingly. Artificial intelligence (AI) based chatbots like ChatGPT offer a promising tool to provide reliable and accessible medical information. However, concerns about the accuracy, reliability, and comprehensiveness of AI-generated responses still need to be addressed.
OBJECTIVE
This study aims to evaluate the effectiveness of AI-based chatbots in answering frequently asked patient questions about LSG and compare their performance with that of bariatric surgery experts.
METHODS
The study included four fellowship-trained minimally invasive surgeons (MISs), nine minimally invasive surgery fellows (MIFs), and two general practitioners (GPs) involved in MBS multidiciplinary team, forming the expert group. Seven AI chatbots, including ChatGPT versions 3.5 and 4, Bard, Bing, Claude, Llama, and Perplexity, were selected based on public availability. Forty patient questions about LSG were derived from social media, MBS organizations, and online patient forums. Experts and chatbots answered these questions, and their responses were scored for accuracy and comprehensiveness using a 5-point scale. Statistical analyses were performed to compare group performance.
RESULTS
Chatbots demonstrated a higher overall performance score (2.55 ± 0.95) compared to the expert group (1.92 ± 1.32, P < .001). Among chatbots, ChatGPT-4 achieved the highest performance (2.94 ± 0.24), while Llama had the lowest (2.15 ± 1.23). Expert group scores were highest for MISs (2.36 ± 1.09), followed by GPs (1.90 ± 1.36) and MIFs (1.75 ± 1.36). The readability of chatbot responses was assessed using Flesch-Kincaid scores, revealing that most responses required reading levels between the 11th grade and college level. Furthermore, chatbots exhibited fair reliability and reproducibility in response consistency, with ChatGPT-4 showing the highest test-retest reliability.
CONCLUSIONS
AI-based chatbots provide reliable and comprehensive answers to common patient questions about LSG. These chatbots could play a significant role in patient education. Still, concerns over AI limitations, including readability and the potential for misinformation, must be addressed to ensure effective integration into healthcare.