Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue 2019
DOI: 10.18653/v1/w19-5944
|View full text |Cite
|
Sign up to set email alerts
|

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

Abstract: The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multireference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of multi-reference evaluation, we augment the test set of DailyDialog with multiple references. A series of experiments s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
47
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 52 publications
(55 citation statements)
references
References 32 publications
1
47
0
Order By: Relevance
“…A wellknown reason is that these automatic dialog evaluation metrics rely on modeling the distance between the generated response and a limited number of references available. The fundamental gap between the open-ended nature of the conversations and the limited references (Gupta et al, 2019) is not addressed in methods that are lexical-level based (Papineni et al, 2002;Lin, 2004;Banerjee and Lavie, 2005), embedding based (Rus and Lintean, 2012;Forgues et al, 2014), or learning based (Tao et al, 2018;Lowe et al, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…A wellknown reason is that these automatic dialog evaluation metrics rely on modeling the distance between the generated response and a limited number of references available. The fundamental gap between the open-ended nature of the conversations and the limited references (Gupta et al, 2019) is not addressed in methods that are lexical-level based (Papineni et al, 2002;Lin, 2004;Banerjee and Lavie, 2005), embedding based (Rus and Lintean, 2012;Forgues et al, 2014), or learning based (Tao et al, 2018;Lowe et al, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…For image captioning, we use English Flickr30k (Young et al, 2014), and the STAIR (Yoshikawa et al, 2017) dataset which provides Japanese captions for COCO images. We also explore the multi-turn dialogue dataset DailyDialog (Li et al, 2017) which contains conversations that cover 10 different daily life topics, and its multi-reference test set (Gupta et al, 2019). Table 4 summarises the statistics about the datasets explored for this experiment.…”
Section: Single Representative Sentencementioning
confidence: 99%
“…They measured the correlation, using a regression-based approach, between systems' responses and a large set of both positive and negative human references. Gupta et al (2019) extended the test split of DailyDialog (1k dialogues) with multiple references. They compared the results of using single-reference versus multiple-reference data.…”
Section: Dialogue Evaluationmentioning
confidence: 99%