Background: Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. Methods: We gathered ten research abstracts from five high impact factor medical journals (n=50) and asked ChatGPT to generate research abstracts based on their titles and journals. We evaluated the abstracts using an artificial intelligence (AI) output detector, plagiarism detector, and had blinded human reviewers try to distinguish whether abstracts were original or generated. Results: All ChatGPT-generated abstracts were written clearly but only 8% correctly followed the specific journal's formatting requirements. Most generated abstracts were detected using the AI output detector, with scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% [12.73, 99.98] compared with very low probability of AI-generated output in the original abstracts of 0.02% [0.02, 0.09]. The AUROC of the AI output detector was 0.94. Generated abstracts scored very high on originality using the plagiarism detector (100% [100, 100] originality). Generated abstracts had a similar patient cohort size as original abstracts, though the exact numbers were fabricated. When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, but that the generated abstracts were vaguer and had a formulaic feel to the writing. Conclusion: ChatGPT writes believable scientific abstracts, though with completely generated data. These are original without any plagiarism detected but are often identifiable using an AI output detector and skeptical human reviewers. Abstract evaluation for journals and medical conferences must adapt policy and practice to maintain rigorous scientific standards; we suggest inclusion of AI output detectors in the editorial process and clear disclosure if these technologies are used. The boundaries of ethical and acceptable use of large language models to help scientific writing remain to be determined.
Triple-negative breast cancer accounted for 12% of breast cancers diagnosed in the United States from 2012 to 2016, with a 5-year survival 8% to 16% lower than hormone receptor–positive disease. However, preventive and screening strategies remain tailored to the demographics of less lethal luminal cancers. This review examines the ethnic, genetic, and modifiable risk factors associated with triple-negative breast cancer, which providers must recognize to address the societal disparities of this deadly disease. Most notable is that triple-negative cancers disproportionately affect African American women and carriers of germline BRCA and PALB2 mutations. Even controlling for treatment delays, stage, and socioeconomic factors, African Americans with triple-negative breast cancer remain nearly twice as likely to die of their disease. To level the playing field, we must integrate genomic predictors of disease and epidemiologic characteristics of molecular breast cancer subtypes to provide personalized risk assessment, screening, and treatment for each patient.
The Cancer Genome Atlas (TCGA) is one of the largest biorepositories of digital histology. Deep learning (DL) models have been trained on TCGA to predict numerous features directly from histology, including survival, gene expression patterns, and driver mutations. However, we demonstrate that these features vary substantially across tissue submitting sites in TCGA for over 3,000 patients with six cancer subtypes. Additionally, we show that histologic image differences between submitting sites can easily be identified with DL. Site detection remains possible despite commonly used color normalization and augmentation methods, and we quantify the image characteristics constituting this site-specific digital histology signature. We demonstrate that these site-specific signatures lead to biased accuracy for prediction of features including survival, genomic mutations, and tumor stage. Furthermore, ethnicity can also be inferred from site-specific signatures, which must be accounted for to ensure equitable application of DL. These site-specific signatures can lead to overoptimistic estimates of model performance, and we propose a quadratic programming method that abrogates this bias by ensuring models are not trained and validated on samples from the same site.
IMPORTANCE Postoperative chemoradiation is the standard of care for cancers with positive margins or extracapsular extension, but the benefit of chemotherapy is unclear for patients with other intermediate risk features. OBJECTIVE To evaluate whether machine learning models could identify patients with intermediate-risk head and neck squamous cell carcinoma who would benefit from chemoradiation.
ImportanceGiven conflicting results regarding the prognosis of erb-b2 receptor tyrosine kinase 2 (ERBB2; formerly HER2 or HER2/neu)–low breast cancer, a large-scale, nationally applicable comparison of ERBB2-low vs ERBB2-negative breast cancer is needed.ObjectiveTo investigate whether ERBB2-low breast cancer is a clinically distinct subtype in terms of epidemiological characteristics, prognosis, and response to neoadjuvant chemotherapy.Design/Participants/SettingThis retrospective cohort study was conducted using the National Cancer Database, including 1 136 016 patients in the US diagnosed with invasive breast cancer from January 1, 2010, to December 31, 2019, who had ERBB2-negative disease and had immunohistochemistry results available. ERBB2-low tumors were classified as having an immunohistochemistry score of 1+, or 2+ with a negative in situ hybridization test. Data were analyzed from November 1, 2021, through November 30, 2022.ExposuresStandard therapy according to routine clinical practice.Main Outcomes and MeasuresThe primary outcomes were overall survival (OS), reported as adjusted hazard ratios (aHRs), and pathologic complete response, reported as adjusted odds ratios (aORs), for ERBB2-negative vs ERBB2-low breast cancer, controlling for age, sex, race and ethnicity, Charlson-Deyo Comorbidity Index score, treatment facility type, tumor grade, tumor histology, hormone receptor status, and cancer stage.ResultsThe study identified 1 136 016 patients (mean [SD] age, 62.4 [13.1] years; 99.1% female; 78.6% non-Hispanic White), of whom 392 246 (34.5%) were diagnosed with ERBB2-negative and 743 770 (65.5%) with ERBB2-low breast cancer. The mean (SD) age of the ERBB2-negative group was 62.1 (13.2) years and 62.5 (13.0) years for the ERBB2-low group. Higher estrogen receptor expression was associated with increased rates of ERBB2-low disease (aOR, 1.15 per 10% increase). Compared with non-Hispanic White patients, of whom 66.1% were diagnosed with ERBB2-low breast cancer, fewer non-Hispanic Black (62.8%) and Hispanic (61.0%) patients had ERBB2-low disease, although in non-Hispanic Black patients this was mediated by differences in rates of triple-negative disease and other confounders. A slightly lower rate of pathologic complete response was seen in patients with ERBB2-low disease vs patients with ERBB2-negative disease on multivariable analysis (aOR, 0.89; 95% CI, 0.86-0.92; P < .001). ERBB2-low status was also associated with small improvements in OS for stage III (aHR, 0.92; 95% CI, 0.89-0.96; P < .001) and stage IV (aHR, 0.91; 95% CI, 0.87-0.96; P < .001) triple-negative breast cancer, although this amounted to only a 2.0% (stage III) and 0.4% (stage IV) increase in 5-year OS.Conclusions and RelevanceThis large-scale retrospective cohort analysis found minimal prognostic differences between ERBB2-low and ERBB2-negative breast cancer. These findings suggest that, moving forward, outcomes in ERBB2-low breast cancer will be driven by ERBB2-directed antibody-drug conjugates, rather than intrinsic differences in biological characteristics associated with low-level ERBB2 expression. These findings do not support the classification of ERBB2-low breast cancer as a unique disease entity.
Purpose Pathologic complete response (pCR) to neoadjuvant chemotherapy (NAC) in early breast cancer (EBC) is largely dependent on breast cancer subtype, but no clinical-grade model exists to predict response and guide selection of treatment. A biophysical simulation of response to NAC has the potential to address this unmet need. Methods We conducted a retrospective evaluation of a biophysical simulation model as a predictor of pCR. Patients who received standard NAC at the University of Chicago for EBC between January 1st, 2010 and March 31st, 2020 were included. Response was predicted using baseline breast MRI, clinicopathologic features, and treatment regimen by investigators who were blinded to patient outcomes. Results A total of 144 tumors from 141 patients were included; 59 were triple-negative, 49 HER2-positive, and 36 hormone-receptor positive/HER2 negative. Lymph node disease was present in half of patients, and most were treated with an anthracycline-based regimen (58.3%). Sensitivity and specificity of the biophysical simulation for pCR were 88.0% (95% confidence interval [CI] 75.7 – 95.5) and 89.4% (95% CI 81.3 – 94.8), respectively, with robust results regardless of subtype. In patients with predicted pCR, 5-year event-free survival was 98%, versus 79% with predicted residual disease (log-rank p = 0.01, HR 4.57, 95% CI 1.36 – 15.34). At a median follow-up of 5.4 years, no patients with predicted pCR experienced disease recurrence. Conclusion A biophysical simulation model accurately predicts pCR and long-term outcomes from baseline MRI and clinical data, and is a promising tool to guide escalation/de-escalation of NAC.
Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, ‘GPT-2 Output Detector’, with % ‘fake’ scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% ‘fake’ [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.