Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.130
|View full text |Cite
|
Sign up to set email alerts
|

Balancing via Generation for Multi-Class Text Classification Improvement

Abstract: Data balancing is a known technique for improving the performance of classification tasks. In this work we define a novel balancing-viageneration framework termed BalaGen. Bala-Gen consists of a flexible balancing policy coupled with a text generation mechanism. Combined, these two techniques can be used to augment a dataset for more balanced distribution. We evaluate BalaGen on three publicly available semantic utterance classification (SUC) datasets. One of these is a new COVID-19 Q&A dataset published here … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
1
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 42 publications
(26 reference statements)
1
1
0
Order By: Relevance
“…2) MC over-predicts majority classes in both datasets (2s and 3s for MR and 1s and 10s for IMDb) while under-predicting the others (except 2s and 3s in IMDb). These results are in line with the common observation that MC models tend to overfit on the majority classes in im- balanced datasets, which motivates the use of "oversampling" or class balancing (Buda et al, 2018;Chawla et al, 2002;Tepper et al, 2020;Gao et al, 2020). OR, in contrast, provides a better fit for MR (slightly under-predicting for 1s), but significantly under-predicts on IMDb majority classes, displaying a much flatter distribution of predictions.…”
Section: Dataset Benchmarkssupporting
confidence: 83%
“…2) MC over-predicts majority classes in both datasets (2s and 3s for MR and 1s and 10s for IMDb) while under-predicting the others (except 2s and 3s in IMDb). These results are in line with the common observation that MC models tend to overfit on the majority classes in im- balanced datasets, which motivates the use of "oversampling" or class balancing (Buda et al, 2018;Chawla et al, 2002;Tepper et al, 2020;Gao et al, 2020). OR, in contrast, provides a better fit for MR (slightly under-predicting for 1s), but significantly under-predicts on IMDb majority classes, displaying a much flatter distribution of predictions.…”
Section: Dataset Benchmarkssupporting
confidence: 83%
“…We demonstrate our methodology and technologies on two publicly available datasets: CQA, a COVID-19 Questions and Answers chatbot data (Tepper et al 2020) and bank-ing77 a banking related queries chatbot data (Casanueva et al 2020).…”
Section: Discussionmentioning
confidence: 99%