Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.343
|View full text |Cite
|
Sign up to set email alerts
|

An Analysis of Simple Data Augmentation for Named Entity Recognition

Abstract: Simple yet effective data augmentation techniques have been proposed for sentence-level and sentence-pair natural language processing tasks. Inspired by these efforts, we design and compare data augmentation for named entity recognition, which is usually modeled as a token-level sequence labeling problem. Through experiments on two data sets from the biomedical and materials science domains (i2b2-2010 and MaSciP), we show that simple augmentation can boost performance for both recurrent and transformer-based m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
98
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 123 publications
(120 citation statements)
references
References 33 publications
3
98
0
Order By: Relevance
“…However, as already mentioned, creating a GSC is a complex and costly process. A first alternative is data augmentation, which consists of expanding the training set by applying transformations to training instances without changing their labels (Dai and Adel, 2020). Another alternative is to use DNNs to learn a good language representation model from a large corpus of unannotated documents, and to use transfer learning to adapt the pretrained model to downstream tasks.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, as already mentioned, creating a GSC is a complex and costly process. A first alternative is data augmentation, which consists of expanding the training set by applying transformations to training instances without changing their labels (Dai and Adel, 2020). Another alternative is to use DNNs to learn a good language representation model from a large corpus of unannotated documents, and to use transfer learning to adapt the pretrained model to downstream tasks.…”
Section: Discussionmentioning
confidence: 99%
“…As a consequence, GSCs in the ecological domain are few in number and small in size. To tackle the problem of training data shortage, several techniques have been proposed, including data augmentation (Dai and Adel, 2020) and transfer learning (Giorgi and Bader, 2018; Qiu et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…In the computer vision community, this is a popular approach where, e.g., rotating an image is invariant to the classification of an image's content. For text, on the token level, this can be done by replacing words with equivalents, such as synonyms (Wei and Zou, 2019), entities of the same type (Raiman and Miller, 2017;Dai and Adel, 2020) or words that share the same morphology (Gulordava et al, 2018;Vania et al, 2019). Such replacements can also be guided by a language model that takes context into consideration (Fadaee et al, 2017;Kobayashi, 2018).…”
Section: Data Augmentationmentioning
confidence: 99%
“…Other token-level manipulation methods introduce noise, such as random token shuffling and deletion Wei and Zou, 2019;Dai and Adel, 2020). Models trained on the augmented datasets are expected to be more robust against the considered noise.…”
Section: Introductionmentioning
confidence: 99%