Is artificial data useful for biomedical Natural Language Processing algorithms?

Wang, Zixu; Ive, Julia; Velupillai, Sumithra; Specia, Lucia

doi:10.18653/v1/w19-5026

Cited by 4 publications

(9 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the end the decision about which metric to use in such cases depends on the gain from not missing out on the minority classes that may cost a small drop in the majority classes (which may still end up with relative high performance) that the system owner should weigh. Further, we evaluated the classifier performance on the generated sentences alone (following (Wang et al, 2019)), without the train set, and found that micro accuracy falls by 17.5% and macro accuracy by 7.9%. This metric represents how well the generated dataset represents the train set.…”

Section: Balagen Improving Real-life Suc Corporamentioning

confidence: 99%

Balancing via Generation for Multi-Class Text Classification Improvement

Tepper

Goldbraich

Zwerdling

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Data balancing is a known technique for improving the performance of classification tasks. In this work we define a novel balancing-viageneration framework termed BalaGen. Bala-Gen consists of a flexible balancing policy coupled with a text generation mechanism. Combined, these two techniques can be used to augment a dataset for more balanced distribution. We evaluate BalaGen on three publicly available semantic utterance classification (SUC) datasets. One of these is a new COVID-19 Q&A dataset published here for the first time. Our work demonstrates that optimal balancing policies can significantly improve classifier performance, while augmenting just part of the classes and under-sampling others. Furthermore, capitalizing on the advantages of balancing, we show its usefulness in all relevant BalaGen framework components. We validate the superiority of BalaGen on ten semantic utterance datasets taken from real-life goaloriented dialogue systems. Based on our results we encourage using data balancing prior to training for text classification tasks.

show abstract

Section: Balagen Improving Real-life Suc Corporamentioning

confidence: 99%

Balancing via Generation for Multi-Class Text Classification Improvement

Tepper

Goldbraich

Zwerdling

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

show abstract

“…The generation of synthetic EHR text for use in medical NLP is still at an early stage [3]. Most studies focus on the creation of English EHR text, using hospital discharge summaries from the MIMIC-III database [7,8,13,14]. In addition, a corpus of English Mental Health Records was explored [15].…”

Section: Generating Synthetic Ehr Notesmentioning

confidence: 99%

“…In addition, a corpus of English Mental Health Records was explored [15]. Unlike the mixed healthcare data used in this study, these EHR notes have a more consistent, template-like structure and contain medical jargon, lending itself to clinical/biomedical downstream tasks found in related work [8,[13][14][15]. Most of these studies focused on classification downstream tasks.…”

Section: Generating Synthetic Ehr Notesmentioning

confidence: 99%

“…Decoding from language models is the predominant approach in related work to generate synthetic text. Approaches include unigram-language models and LSTMs [7], as well as transformer-based methods such as GPT-2 [13][14][15]. In particular, Amin-Nejad et al [8] concluded that GPT-2 was suitable for text generation in a low-resource scenario.…”

Section: Generating Synthetic Ehr Notesmentioning

confidence: 99%

“…Prior studies also consider different ways to generate EHR notes with a pre-defined topic. These approaches include conditional generation on clinical context [8,13] and guiding by keyphrases extracted from an original note [14,15,18]. As a result, the synthetic notes inherently have one-to-one relations with the original data.…”

Section: Generating Synthetic Ehr Notesmentioning

confidence: 99%

See 2 more Smart Citations

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Libbi

Trienes

Trieschnigg³

et al. 2021

Future Internet

View full text Add to dashboard Cite

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be costly and cumbersome. Synthetic data presents a promising solution to the privacy concern, if synthetic data has comparable utility to real data and if it preserves the privacy of patients. However, the generation of synthetic text alone is not useful for NLP because of the lack of annotations. In this work, we propose the use of neural language models (LSTM and GPT-2) for generating artificial EHR text jointly with annotations for named-entity recognition. Our experiments show that artificial documents can be used to train a supervised named-entity recognition model for de-identification, which outperforms a state-of-the-art rule-based baseline. Moreover, we show that combining real data with synthetic data improves the recall of the method, without manual annotation effort. We conduct a user study to gain insights on the privacy of artificial text. We highlight privacy risks associated with language models to inform future research on privacy-preserving automated text generation and metrics for evaluating privacy-preservation during text generation.

show abstract

A fuzzy data augmentation technique to improve regularisation

Dabare

Wong

Shiratuddin

et al. 2021

Int J of Intelligent Sys

View full text Add to dashboard Cite

Deep learning (DL) has achieved superior classification in many applications due to its capability of extracting features from the data. However, the success of DL comes with the tradeoff of possible overfitting. The bias towards the data it has seen during the training process leads to poor generalisation. One way of solving this issue is by having enough training data so that the classifier is invariant to many data patterns. In the literature, data augmentation has been used as a type of regularisation method to reduce the chance for the model to overfit. However, most of the relevant works focus on image, sound or text data. There is not much work on numerical data augmentation, although many real-world problems deal with numerical data. In this paper, we propose using a technique based on Fuzzy C-Means clustering and fuzzy membership grades. Fuzzy-related techniques are used to address the variance problem by generating new data items based on fuzzy numbers and each data item's belongings to different fuzzy clusters. This data augmentation technique is used to improve the generalisation of a Deep Neural Network that is suitable for numerical data. By combining the proposed fuzzy data augmentation technique with the Dropout regularisation technique, we manage to balance the classification model's bias-variance tradeoff. Our proposed technique is evaluated using four

show abstract

Is artificial data useful for biomedical Natural Language Processing algorithms?

Cited by 4 publications

References 25 publications

Balancing via Generation for Multi-Class Text Classification Improvement

Balancing via Generation for Multi-Class Text Classification Improvement

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

A fuzzy data augmentation technique to improve regularisation

Contact Info

Product

Resources

About