An Analysis of Simple Data Augmentation for Named Entity Recognition

Dai, Xiang; Adel, Heike

doi:10.18653/v1/2020.coling-main.343

Cited by 123 publications

(120 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, as already mentioned, creating a GSC is a complex and costly process. A first alternative is data augmentation, which consists of expanding the training set by applying transformations to training instances without changing their labels (Dai and Adel, 2020). Another alternative is to use DNNs to learn a good language representation model from a large corpus of unannotated documents, and to use transfer learning to adapt the pretrained model to downstream tasks.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature

Guillarme

Thuiller²

2021

Preprint

View full text Add to dashboard Cite

1. Given the biodiversity crisis, we more than ever need to access information on multiple taxa (e.g. distribution, traits, diet) in the scientific literature to understand, map and predict all-inclusive biodiversity. Tools are needed to automatically extract useful information from the ever-growing corpus of ecological texts and feed this information to open data repositories. A prerequisite is the ability to recognise mentions of taxa in text, a special case of named entity recognition (NER). In recent years, deep learning-based NER systems have become ubiqutous, yielding state-of-the-art results in the general and biomedical domains. However, no such tool is available to ecologists wishing to extract information from the biodiversity literature. 2. We propose a new tool called TaxoNERD that provides two deep neural network (DNN) models to recognise taxon mentions in ecological documents. To achieve high performance, DNN-based NER models usually need to be trained on a large corpus of manually annotated text. Creating such a gold standard corpus (GSC) is a laborious and costly process, with the result that GSCs in the ecological domain tend to be too small to learn an accurate DNN model from scratch. To address this issue, we leverage existing DNN models pretrained on large biomedical corpora using transfer learning. The performance of our models is evaluated on four GSCs and compared to the most popular taxonomic NER tools. 3. Our experiments suggest that existing taxonomic NER tools are not suited to the extraction of ecological information from text as they performed poorly on ecologically-oriented corpora, either because they do not take account of the variability of taxon naming practices, or because they do not generalise well to the ecological domain. Conversely, a domain-specific DNN-based tool like TaxoNERD outperformed the other approaches on an ecological information extraction task. 4. Efforts are needed in order to raise ecological information extraction to the same level of performance as its biomedical counterpart. One promising direction is to leverage the huge corpus of unlabelled ecological texts to learn a language representation model that could benefit downstream tasks. These efforts could be highly beneficial to ecologists on the long term.

show abstract

Section: Discussionmentioning

confidence: 99%

“…As a consequence, GSCs in the ecological domain are few in number and small in size. To tackle the problem of training data shortage, several techniques have been proposed, including data augmentation (Dai and Adel, 2020) and transfer learning (Giorgi and Bader, 2018; Qiu et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature

Guillarme

Thuiller²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In the computer vision community, this is a popular approach where, e.g., rotating an image is invariant to the classification of an image's content. For text, on the token level, this can be done by replacing words with equivalents, such as synonyms (Wei and Zou, 2019), entities of the same type (Raiman and Miller, 2017;Dai and Adel, 2020) or words that share the same morphology (Gulordava et al, 2018;Vania et al, 2019). Such replacements can also be guided by a language model that takes context into consideration (Fadaee et al, 2017;Kobayashi, 2018).…”

Section: Data Augmentationmentioning

confidence: 99%

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Hedderich¹,

Lange²,

Adel³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

136

View full text Add to dashboard Cite

Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.

show abstract

“…Other token-level manipulation methods introduce noise, such as random token shuffling and deletion Wei and Zou, 2019;Dai and Adel, 2020). Models trained on the augmented datasets are expected to be more robust against the considered noise.…”

Section: Introductionmentioning

confidence: 99%

Substructure Substitution: Structured Data Augmentation for NLP

Shi¹,

Livescu²,

Gimpel³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

We study a family of data augmentation methods, substructure substitution (SUB 2 ), that generalizes prior methods. SUB 2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with others having the same label. This idea can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB 2 based on text spans or parse trees, introducing structureaware data augmentation methods to general NLP tasks. For most cases, training with a dataset augmented by SUB 2 achieves better performance than training with the original training set. Further experiments show that SUB 2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset. 1

show abstract

An Analysis of Simple Data Augmentation for Named Entity Recognition

Cited by 123 publications

References 33 publications

TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature

TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

Substructure Substitution: Structured Data Augmentation for NLP

Contact Info

Product

Resources

About