Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Ghazarian, Sarik; Wei, Johnny Tian-Zheng; Galstyan, Aram; Peng, Nanyun

doi:10.48550/arxiv.1904.10635

Cited by 7 publications

(15 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Learning-based metrics. This type of metric always consists of one or more training models, such as ADEM [24], RUBER [47], PONE [18], and BERT-RUBER [10]. Following previous work [10,23], we select BERT-RUBER as the sole representative learningbased metric in this study given its superior performance.…”

Section: Baseline Metricsmentioning

confidence: 99%

“…This type of metric always consists of one or more training models, such as ADEM [24], RUBER [47], PONE [18], and BERT-RUBER [10]. Following previous work [10,23], we select BERT-RUBER as the sole representative learningbased metric in this study given its superior performance. Since the performance of learning-based models could be influenced by the pre-prepared training dataset [23], we train and tune the model based on the specific dataset we use.…”

Section: Baseline Metricsmentioning

confidence: 99%

“…Especially, in TC, we can find that many POS tag sets can reach significant improvements against the original baseline. From this perspective, POSSCORE is better than PTLC Table 10 shows the comparison between POSSCORE and other more recent state-of-the-art metrics: BERT-Score [53] and BERT-RUBER [10]. Here we use 'ADJ + ADV + VERB + PROPN + NOUN' to calculate POSSCORE as it generally performs the best (as shown in Table 9).…”

Section: Posscorementioning

confidence: 99%

“…Since BERT-RUBER needs to pre-train unreferenced models, we split each dataset into training datasets (80% of the whole datasets), develop datasets (10% of the whole datasets), and test datasets (10% of the whole datasets). Following previous work [10], we also use 2 layers of the bidirectional gated recurrent unit with the 128-dimensional hidden unit and apply three layers for MLP (Multilayer Perceptron Network) with 256, 512 and 128-dimensional hidden units. Learning rate decay is applied when no improvement was observed on validation data for five consecutive epochs.…”

Section: Posscorementioning

confidence: 99%

“…To face such challenges, previous work has proposed a number of automatic evaluation metrics to quantify the semantic similarity of utterances to references (ground-truth responses) and leverage this as the proxy for relevance. Examples of those metrics include word-overlap based measures (e.g., BLEU [29], METEOR [2]) , wordembedding based metrics (e.g., Embedding Average [43], Soft Cosine Similarity [44] and BERTScore [53]), and learning-based metrics (e.g., BERT-RUBER [10]). However, most of the above metrics use the entire sentence as the input and treat all the words of the responses equally in the evaluation process, which inevitably brings much noise in estimating relevance.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling

Liu,

Zhou,

Mao

et al. 2021

Preprint

View full text Add to dashboard Cite

Conversational search systems, such as Google Assistant and Microsoft Cortana, provide a new search paradigm where users are allowed, via natural language dialogues, to communicate with search systems. Evaluating such systems is very challenging since search results are presented in the format of natural language sentences. Given the unlimited number of possible responses, collecting relevance assessments for all the possible responses is infeasible. In this paper, we propose POSSCORE 1 , a simple yet effective automatic evaluation method for conversational search. The proposed embedding-based metric takes the influence of part of speech (POS) of the terms in the response into account. To the best knowledge, our work is the first to systematically demonstrate the importance of incorporating syntactic information, such as POS labels, for conversational search evaluation. Experimental results demonstrate that our metrics can correlate with human preference, achieving significant improvements over state-of-the-art baseline metrics. CCS CONCEPTS• Information systems → Retrieval effectiveness; • Computing methodologies → Discourse, dialogue and pragmatics.

show abstract

Section: Baseline Metricsmentioning

confidence: 99%

Section: Baseline Metricsmentioning

confidence: 99%

Section: Posscorementioning

confidence: 99%

Section: Posscorementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling

Liu,

Zhou,

Mao

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Meta-evaluation of Conversational Search Evaluation Metrics

Liu

Zhou

Wilson

2021

ACM Trans. Inf. Syst.

View full text Add to dashboard Cite

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability : the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity : the ability to agree with ultimate user preference; and (3) intuitiveness : the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

show abstract

Adversarial Language Games for Advanced Natural Language Intelligence

Yao

Zhong

Zhang

et al. 2019

Preprint

View full text Add to dashboard Cite

While adversarial games have been well studied in various board games and electronic sports games, etc., such adversarial games remain a nearly blank field in natural language processing. As natural language is inherently an interactive game, we propose a challenging pragmatics game called Adversarial Taboo, in which an attacker and a defender compete with each other through sequential natural language interactions. The attacker is tasked with inducing the defender to speak a target word invisible to the defender, while the defender is tasked with detecting the target word before being induced by the attacker. In Adversarial Taboo, a successful attacker must hide its intention and subtly induce the defender, while a competitive defender must be cautious with its utterances and infer the intention of the attacker. To instantiate the game, we create a game environment and a competition platform 1 . Sufficient pilot experiments and empirical studies on several baseline attack and defense strategies show promising and interesting results. Based on the analysis on the game and experiments, we discuss multiple promising directions for future research.

show abstract

Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings

Cited by 7 publications

References 18 publications

POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling

POSSCORE: A Simple Yet Effective Evaluation of Conversational Search with Part of Speech Labelling

Meta-evaluation of Conversational Search Evaluation Metrics

Adversarial Language Games for Advanced Natural Language Intelligence

Contact Info

Product

Resources

About