Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Tang, Ruixiang; Han, Xiaotian; Jiang, Xiaoqian; Hu, Xia

doi:10.48550/arxiv.2303.04360

Cited by 24 publications

(30 citation statements)

References 32 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Improving LLMs for NER tasks requires at least further fine-tuning but more likely supplying these with domain-specific training data for domains that they are not trained on. In our case, not all biomedical literature is freely shareable, and it is therefore not possible to send these data to external platforms to train such models, a problem that is potentially solvable by generating synthetic data for closed systems . Another issue is available compute to train such models that, even with open LLMs such as LLaMa, require much more resources to train than BERT-based models.…”

Section: Resultsmentioning

confidence: 99%

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

Wang,

Vijayaraghavan,

Beck

et al. 2024

J. Proteome Res.

View full text Add to dashboard Cite

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from and .

show abstract

Section: Resultsmentioning

confidence: 99%

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

Wang,

Vijayaraghavan,

Beck

et al. 2024

J. Proteome Res.

View full text Add to dashboard Cite

show abstract

“…If there is a relation, then the label should be "Yes", otherwise "No". (Tang et al, 2023) HoC document: < text>; target: The correct category for this document is ? You must choose from the given list of answer categories (introduce what each category is ...)" (Chen et al, 2023) Table 4: The prompts used for different evaluation tasks and datasets.…”

Section: Metricsmentioning

confidence: 99%

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

Liu,

Li,

Zhou

et al. 2024

Preprint

View full text Add to dashboard Cite

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering task with answer options for evaluation. However, in real clinical settings, many clinical decisions, such as treatment recommendations, involve answering open-ended questions without pre-set options. Meanwhile, existing studies mainly use accuracy to assess model performance. In this paper, we comprehensively benchmark diverse LLMs in healthcare, to clearly understand their strengths and weaknesses. Our benchmark containsseventasks andthirteendatasets across medical language generation, understanding, and reasoning. We conduct a detailed evaluation of existingsixteenLLMs in healthcare under both zero-shot and few-shot (i.e., 1,3,5-shot) learning settings. We report the results onfivemetrics (i.e. matching, faithfulness, comprehensiveness, generalizability, and robustness) that are critical in achieving trust from clinical users. We further invite medical experts to conduct human evaluation.

show abstract

“…The research community explored GLLMs for data generation-based data augmentation in various NLP tasks like dialogue generation [410], training smaller LLMs [411], [416], common sense reasoning [412], hate speech detection [413], undesired content detection [414], question answering [415], [425], intent classification [143], relation extraction [155], [422], instruction tuning [417], [418], paraphrase detection [420], tweet intimacy prediction [421], named entity recognition [422], machine translation [424] etc. GLLM-based data generation for data augmentation is explored in multiple domains like general [143], [155], [412], [416]- [418], [420], [424]- [426], social media [409], [413], [414], [421], [423], news [423], scientific literature [155], [420], healthcare [410], [415], [422], dialogue [419], programming [411] etc. Table 19 presents a summary of research works exploring GLLMs for data generationbased data augmentation.…”

Section: Data Generationmentioning

confidence: 99%

“…Based on the evaluation on four topic classification datasets, the authors observed that (i) the proposed approach enhances the model performance and (ii) reduces the querying cost of ChatGPT by a large margin. Some of the research works explored GLLMs for data generation-based data augmentation in various information extraction tasks like relation extraction [155], relation classification [422] and named entity recognition [422]. Xu et al [155] evaluated how effective is the GPT-3.5 model for relation classification.…”

Section: Data Generationmentioning

confidence: 99%

“…The prompt used for data generation consists of instance descriptions along with some example instances. Tang et al [422] used ChatGPT in zero-shot settings to generate synthetic data for tasks like named entity recognition and relation classification in the healthcare domain. The authors showed that the model fine-tuned on this synthetic data outperforms zero-shot ChatGPT by a large margin in both tasks.…”

Section: Data Generationmentioning

confidence: 99%

See 1 more Smart Citation