DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Ding, Bosheng; Liu, Linlin; Kruengkrai, Canasai; Nguyen, Thien Hai; Joty, Shafiq; Si, Luo; Chen, Miao

doi:10.48550/arxiv.2011.01549

Cited by 7 publications

(12 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Language modeling has already been used as an augmentation method to generate labeled and unlabeled examples for NER in DAGA (Ding et al, 2020). However, our taggers overperform the taggers presented on the gold standard by 30 points at size 1000 and 9 points at full size.…”

Section: Introductionmentioning

confidence: 93%

See 1 more Smart Citation

Generating unlabelled data for a tri-training approach in a low resourced NER task

Boulanger¹,

Rosset²

2022

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

View full text Add to dashboard Cite

Training a tagger for Named Entity Recognition (NER) requires a substantial amount of labeled data in the task domain. Manual labeling is a tedious and complicated task. Semisupervised learning methods can reduce the quantity of labeled data necessary to train a model. However, these methods require large quantities of unlabeled data, which remains an issue in many cases.We address this problem by generating unlabeled data. Large language models have proven to be powerful tools for text generation. We use their generative capacity to produce new sentences and variations of the sentences of our available data. This generation method, combined with a semi-supervised method, is evaluated on CoNLL and I2B2. We prepare both of these corpora to simulate a low resource setting. We obtain significant improvements for semisupervised learning with synthetic data against supervised learning on natural data.

show abstract

Section: Introductionmentioning

confidence: 93%

“…However, in tagging, paraphrasing using back-translation (Neuraz et al, 2018) is not bringing significant improvements. Recent works show that using language models learned on the training data to generate labeled and unlabeled examples can bring improvements (Ding et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Generating unlabelled data for a tri-training approach in a low resourced NER task

Boulanger¹,

Rosset²

2022

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Similarly to (Zhou et al, 2019;Ding et al, 2020), we simulate a low resource setting by randomly sampling tiny subsets of the training data. Since our focus is to measure the contextual learning ability of models, we first selected sentences of CONLL training data that contain at least one entity followed or preceded by 3 non-entity words.…”

Section: Low Resource Settingmentioning

confidence: 99%

Context-aware Adversarial Training for Name Regularity Bias in Named Entity Recognition

Ghaddar,

Langlais,

Rashid

et al. 2021

Preprint

View full text Add to dashboard Cite

In this work, we examine the ability of NER models to use contextual information when predicting the type of an ambiguous entity. We introduce NRB, a new testbed carefully designed to diagnose Name Regularity Bias of NER models. Our results indicate that all state-of-the-art models we tested show such a bias; BERT fine-tuned models significantly outperforming feature-based (LSTM-CRF) ones on NRB, despite having comparable (sometimes lower) performances on standard benchmarks.To mitigate this bias, we propose a novel model-agnostic training method which adds learnable adversarial noise to some entity mentions, thus enforcing models to focus more strongly on the contextual signal, leading to significant gains on NRB. Combining it with two other training strategies, data augmentation and parameter freezing, leads to further gains.

show abstract

“…Ding et al [163] introduced generative language model-based augmentation approach using rnnlm for low-resource tagging tasks. This sentence-level augmentation approach linearized labeled sentences before training the language model to learn the context and distribution entity words.…”

Section: Data Augmentation For Nermentioning

confidence: 99%

“…Cross-domain augmentation approach [165] was explored to leverage data from high resource domains and apply learned linguistic patterns such as structure, style and noise on the low resource domains. In this feature-based augmentation approach, a linearized sentence pair [163] from source and target domain ,is used as an input to the autoencoder model. The model performs "word-by-word" denoising reconstruction followed by detransforming reconstruction.…”

Section: Data Augmentation For Nermentioning

confidence: 99%

Data augmentation for name entity recognition

Kyaw¹

View full text Add to dashboard Cite

A/Prof Chng provided the initial project direction.• Charagan and I proposed key research idea and co-designed the model architecture.• Charangan and Nga conducted the experiments as well as prepared the manuscript drafts.• Dr. Tong Ron and Sparsh Jain contributed to the analysis of the results as well as reviewed the manuscript drafts. 28/08/2022 Date Kyaw Zin Tun iii Acknowledgement Assoc Prof. Chng Eng Siong has been an ideal teacher, mentor, and thesis supervisor, offering advice and encouragement with a perfect blend of insight and patience. I'm truly proud of, and grateful for, my time working and studying under him.

show abstract

DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Cited by 7 publications

References 38 publications

Generating unlabelled data for a tri-training approach in a low resourced NER task

Generating unlabelled data for a tri-training approach in a low resourced NER task

Context-aware Adversarial Training for Name Regularity Bias in Named Entity Recognition

Data augmentation for name entity recognition

Contact Info

Product

Resources

About