DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.

show abstract

“…It then generates additional sentences that fit this label. Ding et al (2020) extend this idea for token level tasks.…”

Section: Data Augmentationmentioning

confidence: 99%

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Hedderich¹,

Lange²,

Adel³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

116

View full text Add to dashboard Cite

show abstract

“…Although labeled sequence translation generates high quality multilingual NER training data, it adds limited variety since translation does not introduce new entities or contexts. Inspired by DAGA (Ding et al, 2020), we propose a generation-based multilingual data augmentation method to add more diversity to the training data. DAGA is a monolingual data augmentation method designed for sequence labeling tasks, which has been shown to be able to add significant diversity to the training data.…”

Section: Synthetic Data Generation With Language Modelsmentioning

confidence: 99%

“…For NLP, back translation (Sennrich et al, 2016) is one of the most successful data augmentation approaches, which translates target-language monolingual data to the source language to generate more parallel data for MT model training. Other popular approaches include synonym replacement (Kobayashi, 2018), random deletion/swap/insertion Kumar et al, 2020), generation (Ding et al, 2020), etc. Data augmentation has also been proven to be useful in the cross-lingual settings Singh et al, 2020;Riabi et al, 2020;Qin et al, 2020;, but most of the exiting methods overlook the better utilization of multilingual training data when such resources are available.…”

Section: Related Workmentioning

confidence: 99%

“…However, except for some resource-rich languages (e.g., English, German), training sets for most of the other languages are still very limited. Moreover, it is usually expensive and time-consuming to annotate such data, particularly for low-resource languages (Kruengkrai et al, 2020). Therefore, zero-shot cross-lingual NER has attracted growing interest recently, especially with the influx of deep learning methods (Mayhew et al, 2017;Joty et al, 2017;Jain et al, 2019;.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

Liu¹,

Ding²,

Joty³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Named Entity Recognition (NER) for lowresource languages is a both practical and challenging research problem. This paper addresses zero-shot transfer for cross-lingual NER, especially when the amount of sourcelanguage training data is also limited. The paper first proposes a simple but effective labeled sequence translation method to translate source-language training data to target languages and avoids problems such as word order change and entity span determination. With the source-language data as well as the translated data, a generation-based multilingual data augmentation method is introduced to further increase diversity by generating synthetic labeled data in multiple languages. These augmented data enable the language model based NER models to generalize better with both the language-specific features from the target-language synthetic data and the language-independent features from multilingual synthetic data. An extensive set of experiments were conducted to demonstrate encouraging cross-lingual transfer performance of the new research on a wide variety of target languages. 1 * Equal contribution, order decided by coin flip. Linlin Liu and Bosheng Ding are under the Joint PhD Program between Alibaba and Nanyang Technological University.

show abstract

“…Models trained on the augmented datasets are expected to be more robust against the considered noise. (Bergmanis et al, 2017;Liu et al, 2020a;Ding et al, 2020;Liu et al, 2020b, inter alia), or applying post-processing on the examples generated by pretrained models Wan et al, 2020;Yoo et al, 2020). In the data augmentation stage, given task-specific constraints, such models generate associated text accordingly.…”

Section: Introductionmentioning

confidence: 99%

Substructure Substitution: Structured Data Augmentation for NLP

Shi¹,

Livescu²,

Gimpel³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

We study a family of data augmentation methods, substructure substitution (SUB 2 ), that generalizes prior methods. SUB 2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with others having the same label. This idea can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB 2 based on text spans or parse trees, introducing structureaware data augmentation methods to general NLP tasks. For most cases, training with a dataset augmented by SUB 2 achieves better performance than training with the original training set. Further experiments show that SUB 2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset. 1

show abstract

DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

Cited by 73 publications

References 42 publications

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

MulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER

Substructure Substitution: Structured Data Augmentation for NLP

Contact Info

Product

Resources

About