Álvaro Rodrigo scite author profile

In this paper, we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation, in and of itself, is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost-and time-intensive. Thus, much work has been put into finding methods which allow a reduction in involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then present the evaluation methods regarding that class.

show abstract

Overview of the Answer Validation Exercise 2006

Peñas

Rodrigo

Sama

et al. 2007

View full text Add to dashboard Cite

The first Answer Validation Exercise (AVE) has been launched at the Cross Language Evaluation Forum 2006. This task is aimed at developing systems able to decide whether the answer of a Question Answering system is correct or not. The exercise is described here together with the evaluation methodology and the systems results. The starting point for the AVE 2006 was the reformulation of the Answer Validation as a Recognizing Textual Entailment problem, under the assumption that hypothesis can be automatically generated instantiating hypothesis patterns with the QA systems' answers. 11 groups have participated with 38 runs in 7 different languages. Systems that reported the use of logic have obtained the best results in their respective subtasks.

show abstract

QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation

Peñas

Hovy

Forner³

et al. 2013

View full text Add to dashboard Cite

This paper describes the methodology for testing the performance of Machine Reading systems through Question Answering and Reading Comprehension Tests. This was the attempt of the QA4MRE challenge which was run as a Lab at CLEF 2011-2013. The traditional QA task was replaced by a new Machine Reading task, whose intention was to ask questions that required a deep knowledge of individual short texts and in which systems were required to choose one answer, by analysing the corresponding test document in conjunction with background text collections provided by the organization. Four different tasks have been organized during these years: Main Task, Processing Modality and Negation for Machine Reading, Machine Reading of Biomedical Texts about Alzheimer's disease, and Entrance Exams. This paper describes their motivation, their goals, their methodology for preparing the data sets, their background collections, their metrics used for the evaluation, and the lessons learned along these three years.

show abstract

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Deriu

Tuggener

Däniken³

et al. 2020

View full text Add to dashboard Cite

The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., Survival Analysis. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several stateof-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool.

show abstract

Overview of ResPubliQA 2009: Question Answering Evaluation over European Legislation

Peñas

Forner²,

Sutcliffe

et al. 2010

View full text Add to dashboard Cite

Abstract. This paper describes the first round of ResPubliQA, a Question Answering (QA) evaluation task over European legislation, proposed at the Cross Language Evaluation Forum (CLEF) 2009. The exercise consists of extracting a relevant paragraph of text that satisfies completely the information need expressed by a natural language question. The general goals of this exercise are (i) to study if the current QA technologies tuned for newswire collections and Wikipedia can be adapted to a new domain (law in this case); (ii) to move to a more realistic scenario, considering people close to law as users, and paragraphs as system output; (iii) to compare current QA technologies with pure Information Retrieval (IR) approaches; and (iv) to introduce in QA systems the Answer Validation technologies developed in the past three years. The paper describes the task in more detail, presenting the different types of questions, the methodology for the creation of the test sets and the new evaluation measure, and analyzing the results obtained by systems and the more successful approaches. Eleven groups participated with 28 runs. In addition, we evaluated 16 baseline runs (2 per language) based only in pure IR approach, for comparison purposes. Considering accuracy, scores were generally higher than in previous QA campaigns.

show abstract

Overview of the Answer Validation Exercise 2007

Peñas

Rodrigo

Verdejo

View full text Add to dashboard Cite

show abstract

A study about the future evaluation of Question-Answering systems

Rodrigo

Peñas

2017

Knowledge-Based Systems

View full text Add to dashboard Cite

SPARTE, a Test Suite for Recognising Textual Entailment in Spanish

Peñas

Rodrigo

Verdejo

2006

View full text Add to dashboard Cite

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Álvaro Rodrigo

Survey on evaluation methods for dialogue systems

Overview of the Answer Validation Exercise 2006

QA4MRE 2011-2013: Overview of Question Answering for Machine Reading Evaluation

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

Overview of ResPubliQA 2009: Question Answering Evaluation over European Legislation

Overview of the Answer Validation Exercise 2007

A study about the future evaluation of Question-Answering systems

SPARTE, a Test Suite for Recognising Textual Entailment in Spanish

Contact Info

Product

Resources

About