Balancing via Generation for Multi-Class Text Classification Improvement

Tepper, Naama; Goldbraich, Esther; Zwerdling, Naama; Kour, George; Tavor, Ateret Anaby; Carmeli, Boaz

doi:10.18653/v1/2020.findings-emnlp.130

Cited by 5 publications

(2 citation statements)

References 42 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2) MC over-predicts majority classes in both datasets (2s and 3s for MR and 1s and 10s for IMDb) while under-predicting the others (except 2s and 3s in IMDb). These results are in line with the common observation that MC models tend to overfit on the majority classes in im- balanced datasets, which motivates the use of "oversampling" or class balancing (Buda et al, 2018;Chawla et al, 2002;Tepper et al, 2020;Gao et al, 2020). OR, in contrast, provides a better fit for MR (slightly under-predicting for 1s), but significantly under-predicts on IMDb majority classes, displaying a much flatter distribution of predictions.…”

Section: Dataset Benchmarkssupporting

confidence: 83%

Error-Sensitive Evaluation for Ordinal Target Variables

Chen¹,

Courtland²,

Faulkner³

et al. 2021

Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

Product reviews and satisfaction surveys seek customer feedback in the form of ranked scales. In these settings, widely used evaluation metrics including F1 and accuracy ignore the rank in the responses (e.g., 'very likely' is closer to 'likely' than 'not at all'). In this paper, we hypothesize that the order of class values is important for evaluating classifiers on ordinal target variables and should not be disregarded. To test this hypothesis, we compared Multi-class Classification (MC) and Ordinal Regression (OR) by applying OR and MC to benchmark tasks involving ordinal target variables using the same underlying model architecture. Experimental results show that while MC outperformed OR for some datasets in accuracy and F1, OR is significantly better than MC for minimizing the error between prediction and target for all benchmarks, as revealed by error-sensitive metrics, e.g. mean-squared error (MSE) and Spearman correlation. Our findings motivate the need to establish consistent, error-sensitive metrics for evaluating benchmarks with ordinal target variables, and we hope that it stimulates interest in exploring alternative losses for ordinal problems.

show abstract

Section: Dataset Benchmarkssupporting

confidence: 83%

Error-Sensitive Evaluation for Ordinal Target Variables

Chen¹,

Courtland²,

Faulkner³

et al. 2021

Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

show abstract

“…We demonstrate our methodology and technologies on two publicly available datasets: CQA, a COVID-19 Questions and Answers chatbot data (Tepper et al 2020) and bank-ing77 a banking related queries chatbot data (Casanueva et al 2020).…”

Section: Discussionmentioning

confidence: 99%

High-quality Conversational Systems

Ackerman¹,

Anaby-Tavor²,

Farchi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Conversational systems or chatbots are an example of AI-Infused Applications (AIIA). Chatbots are especially important as they are often the first interaction of clients with a business and are the entry point of a business into the AI (Artificial Intelligence) world. The quality of the chatbot is, therefore, key. However, as is the case in general with AI-IAs, it is especially challenging to assess and control the quality of chatbot systems. Beyond the inherent statistical nature of these systems, where occasional failure is acceptable, we identify two major challenges. The first is to release an initial system that is of sufficient quality such that humans will interact with it. The second is to maintain the quality, enhance its capabilities, improve it and make necessary adjustments based on changing user requests or drift. These challenges exist because it is impossible to predict the real distribution of user requests and the natural language they will use to express these requests. Moreover, any empirical distribution of requests is likely to change over time. This may be due to periodicity, changing usage, and drift of topics. We provide a methodology and set of technologies to address these challenges and to provide automated assistance through a human-in-the-loop approach. We notice that it is crucial to connect between the different phases in the lifecycle of the chatbot development and to make sure it provides its expected business value. For example, that it frees human agents to deal with tasks other than answering human users. Our methodology and technologies apply during chatbot training in the pre-production phase, through to chatbot usage in the field in the post-production phase. They implement the 'test first' paradigm by assisting in agile design, and support continuous integration through actionable insights.

show abstract