Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09 2009
DOI: 10.3115/1667583.1667676
|View full text |Cite
|
Sign up to set email alerts
|

Validating the web-based evaluation of NLG systems

Abstract: The GIVE Challenge is a recent shared task in which NLG systems are evaluated over the Internet. In this paper, we validate this novel NLG evaluation methodology by comparing the Internet-based results with results we collected in a lab experiment. We find that the results delivered by both methods are consistent, but the Internetbased approach offers the statistical power necessary for more fine-grained evaluations and is cheaper to carry out.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2010
2010
2012
2012

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 6 publications
0
2
0
Order By: Relevance
“…Such effects have been found in previous studies; for example, Krahmer and Swerts (2004) found that Dutch and Italian subjects perceived the role of prosodically linked eyebrow movements differently. However, Koller et al (2009) have recently replicated in a lab-based study the results from an online study of a set of natural-language generation systems. This indicates that the results from Internet-based evaluation of generated output can be reliable despite the diverse subject pool; however, future lab-based experiments may be advisable to confirm this for this particular task.…”
Section: Discussionmentioning
confidence: 85%
“…Such effects have been found in previous studies; for example, Krahmer and Swerts (2004) found that Dutch and Italian subjects perceived the role of prosodically linked eyebrow movements differently. However, Koller et al (2009) have recently replicated in a lab-based study the results from an online study of a set of natural-language generation systems. This indicates that the results from Internet-based evaluation of generated output can be reliable despite the diverse subject pool; however, future lab-based experiments may be advisable to confirm this for this particular task.…”
Section: Discussionmentioning
confidence: 85%
“…Taxonomies [e.g., 1] and frameworks [e.g., 2] have been proposed often emphasizing the need to distinguish features of user, agent, and task. It is more common now for evaluation of ECAs, or component models such as natural language generation or text to speech (TTS) synthesis systems, to consist of both objective and subjective measures [2][3][4][5][6][7]. There are still instances, however, where data are collected in the absence of the manipulation of specific variables (comparison conditions) or without a control condition [e.g.…”
Section: Introductionmentioning
confidence: 99%