Data Augmentation for Intent Classification with Off-the-shelf Large Language Models

Sahu, Gaurav; Rodríguez, Paul; Laradji, Issam; Atighehchian, Parmida; Vázquez, David; Bahdanau, Dzmitry

doi:10.18653/v1/2022.nlp4convai-1.5

Cited by 24 publications

(11 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous work has used language models to generate synthetic data to increase the amount of available data using pretrained models (Kumar et al, 2020). Some examples of downstream tasks are text classification , intent classification (Sahu et al, 2022), toxic language detection (Hartvigsen et al, 2022), text mining (Tang et al, 2023), or mathematical reasoning (Liu et al, 2023b), inter alia. Synthetic data is also used to pretrain and distill language models.…”

Section: Natural Language Annotation and Data Generation Using Llmsmentioning

confidence: 99%

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

Muñoz-Ortiz,

Gómez-Rodríguez,

Vilares

2024

Preprint

View full text Add to dashboard Cite

We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from from six different LLMs that cover three different families and four sizes in total. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Human texts exhibit more scattered sentence length distributions, more variety of vocabulary, a distinct use of dependency and constituent types, shorter constituents, and more optimized dependency distances. Humans tend to exhibit stronger negative emotions (such as fear and disgust) and less joy compared to text generated by LLMs, with the toxicity of these models increasing as their size grows. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs, and even magnified in all of them but one. Differences between LLMs and humans are larger than between LLMs.

show abstract

Section: Natural Language Annotation and Data Generation Using Llmsmentioning

confidence: 99%

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

Muñoz-Ortiz,

Gómez-Rodríguez,

Vilares

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…In the field of intent detection, previous work has proposed using data augmentation techniques to generate synthetic training data (Sahu et al, 2022;. Sahu et al (2022) also used PLMs to generate augmented examples, but they require human effort for labeling. This is a challenging task since it is expensive to annotate large amounts of data.…”

Section: Related Workmentioning

confidence: 99%

“…Following Sahu et al (2022), we wanted to see if it is effective to use the available data to train an intent classifier and then use it to relabel the synthetic data. Intuitively, such a method would correct mistakes in the generation process.…”

Section: Data Relabellingmentioning

confidence: 99%

Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information

Lin,

Papangelis,

Kim

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

This work focuses on in-context data augmentation for intent detection. Having found that augmentation via in-context prompting of large pretrained language models (PLMs) alone does not improve performance, we introduce a novel approach based on PLMs and pointwise Vinformation (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints -utterances that correspond to given intents. It then employs intent-aware filtering, based on PVI, to remove datapoints that are not helpful to the downstream intent classifier. Our method is thus able to leverage the expressive power of large language models to produce diverse training data. Empirical results demonstrate that our method can produce synthetic training data that achieve state-of-the-art performance on three challenging intent detection datasets under few-shot settings (1.28% absolute improvement in 5-shot and 1.18% absolute in 10-shot, on average) and perform on par with the state-of-the-art in full-shot settings (within 0.01% absolute, on average).

show abstract

“…Moreover, data augmentation techniques were utilized in [20] to enhance the robustness of NLP models. Furthermore, [36] employs ChatGPT to generate new data, showcasing another innovative application of data augmentation techniques in NLP. 2023/4/52…”

Section: Related Workmentioning

confidence: 99%

CAugment: An Approach to Diversifying Dataset by Combining Image Processing Operations

Gao

2023

ITC

View full text Add to dashboard Cite

In deep learning, model quality is extremely important. Consequently, the quality and the sufficiency of the datasets for training models have attracted considerable attention from both industry and academia. Automatic data augmentation, which provides a means of using image processing operators to generate data from existing datasets, is quite effective in searching for mutants of the images and expanding the training datasets. However, existing automatic data augmentation techniques often fail to fully exploit the potential of the data, failing to balance the search efficiency and the model accuracy. This paper presents CAugment, a novel approach to diversifying image datasets by combining image processing operators. Given a training image dataset, CAugment is composed of: 1) the three-level evolutionary algorithm (TLEA) that employs three levels of atomic operations for augmenting the dataset and an adaptive strategy for decreasing granularity and 2) a design that uses the three-dimensional evaluation method (TDEM) and a dHash algorithm to measure the diversity of the dataset. The search space can be expanded, which further improves model accuracy during training. We use CAugment to augment the CIFAR-10/100 and SVHN datasets and use the augmented datasets to train the WideResNet and Shake-Shake models. Our results show that the amount of data increases linearly along with the training epochs; in addition, the models trained by the CAugment-augmented datasets outperform those trained by the datasets augmented by the other techniques by up to 17.9% in accuracy on the SVHN dataset.

show abstract

Data Augmentation for Intent Classification with Off-the-shelf Large Language Models

Cited by 24 publications

References 38 publications

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

Contrasting Linguistic Patterns in Human and LLM-Generated News Text

Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information

CAugment: An Approach to Diversifying Dataset by Combining Image Processing Operations

Contact Info

Product

Resources

About