2023
DOI: 10.3384/nejlt.2000-1533.2023.4725
|View full text |Cite
|
Sign up to set email alerts
|

NL-Augmenter 🦎 → 🐍 A Framework for Task-Sensitive Natural Language Augmentation

Abstract: Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 71 publications
0
1
0
Order By: Relevance
“…Much of their success has been extended beyond language related tasks -essentially and arguably, any type of data with sequential properties like speech, music, etc. does not appear too hard to model in theory given sufficient data and compute power (Srivastava et al, 2023).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Much of their success has been extended beyond language related tasks -essentially and arguably, any type of data with sequential properties like speech, music, etc. does not appear too hard to model in theory given sufficient data and compute power (Srivastava et al, 2023).…”
Section: Discussionmentioning
confidence: 99%
“…On the other hand, LLMs have improved across a lot of tasks making the socio-technical gap narrower. As there is more exposure to data, LLMs have improved in parameters of cognition and meaning as estimates across language benchmarks are improving Nguyen et al, 2016;Sakaguchi et al, 2021;Srivastava et al, 2023;Wang et al, 2018;Gehrmann et al, 2022.…”
Section: The Framing Trapmentioning
confidence: 99%
“…Text perturbations are divided into symbol-, word-, and sentence-level perturbations. Our selection of text perturbation levels draws upon the methodology designed in NL Augmenter [84]. We use NLPAug 2 , NL Augmenter 3 and back-translation from an EasyNMT 4 to craft perturbations.…”
Section: Questions Perturbationsmentioning
confidence: 99%
“…While we regret this limitation, we note that lack of access to complete pretraining data is a negative aspect that our models share with many other present-day models. Future work may consider increasing the available data via augmentation techniques (Dhole et al, 2021) or mixing with data from a different modality such as code (Muennighoff et al, 2023b,a;. The mC4-Fi and CC-Fi datasets are both derived from Common Crawl data, but cover different sets of crawls and apply different selection criteria and text extraction and filtering pipelines.…”
Section: Limitationsmentioning
confidence: 99%