Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.488
|View full text |Cite
|
Sign up to set email alerts
|

DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Abstract: Data augmentation techniques have been widely used to improve machine learning performance as they enhance the generalization capability of models. In this work, to generate high quality synthetic data for low-resource tagging tasks, we propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings. For the supervised settings, we conduct extensive experiments on named entity recognition (NER), part … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
67
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 73 publications
(67 citation statements)
references
References 42 publications
0
67
0
Order By: Relevance
“…It then generates additional sentences that fit this label. Ding et al (2020) extend this idea for token level tasks.…”
Section: Data Augmentationmentioning
confidence: 99%
“…It then generates additional sentences that fit this label. Ding et al (2020) extend this idea for token level tasks.…”
Section: Data Augmentationmentioning
confidence: 99%
“…Although labeled sequence translation generates high quality multilingual NER training data, it adds limited variety since translation does not introduce new entities or contexts. Inspired by DAGA (Ding et al, 2020), we propose a generation-based multilingual data augmentation method to add more diversity to the training data. DAGA is a monolingual data augmentation method designed for sequence labeling tasks, which has been shown to be able to add significant diversity to the training data.…”
Section: Synthetic Data Generation With Language Modelsmentioning
confidence: 99%
“…For NLP, back translation (Sennrich et al, 2016) is one of the most successful data augmentation approaches, which translates target-language monolingual data to the source language to generate more parallel data for MT model training. Other popular approaches include synonym replacement (Kobayashi, 2018), random deletion/swap/insertion Kumar et al, 2020), generation (Ding et al, 2020), etc. Data augmentation has also been proven to be useful in the cross-lingual settings Singh et al, 2020;Riabi et al, 2020;Qin et al, 2020;, but most of the exiting methods overlook the better utilization of multilingual training data when such resources are available.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Models trained on the augmented datasets are expected to be more robust against the considered noise. (Bergmanis et al, 2017;Liu et al, 2020a;Ding et al, 2020;Liu et al, 2020b, inter alia), or applying post-processing on the examples generated by pretrained models Wan et al, 2020;Yoo et al, 2020). In the data augmentation stage, given task-specific constraints, such models generate associated text accordingly.…”
Section: Introductionmentioning
confidence: 99%