Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG evaluation data, in particular, its reliability. According to existing scales of IAA interpretationsee, for example, Lommel et al. (2014), Liu et al. (2016), Sedoc et al. (2018) and Amidei et al. (2018a)-most data collected for NLG evaluation fail the reliability test. We confirmed this trend by analysing papers published over the last 10 years in NLG-specific conferences (in total 135 papers that included some sort of human evaluation study). Following Sampson and Babarczy (2008), Lommel et al. (2014), Joshi et al. (2016) and Amidei et al. (2018b), such phenomena can be explained in terms of irreducible human language variability. Using three case studies, we show the limits of considering IAA as the only criterion for checking evaluation reliability. Given human language variability, we propose that for human evaluation of NLG, correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability. This is illustrated using the three case studies.
In the last few years Automatic Question Generation (AQG) has attracted increasing interest. In this paper we survey the evaluation methodologies used in AQG. Based on a sample of 37 papers, our research shows that the systems' development has not been accompanied by similar developments in the methodologies used for the systems' evaluation. Indeed, in the papers we examine here, we find a wide variety of both intrinsic and extrinsic evaluation methodologies. Such diverse evaluation practices make it difficult to reliably compare the quality of different generation systems. Our study suggests that, given the rapidly increasing level of research in the area, a common framework is urgently needed to compare the performance of AQG systems and NLG systems more generally.
We define and study quasidialectical systems, which are an extension of Magari’s dialectical systems, designed to make Magari’s formalization of trial and error mathematics more adherent to the real mathematical practice of revision: our proposed extension follows, and in several regards makes more precise, varieties of empiricist positions à la Lakatos. We prove several properties of quasidialectical systems and of the sets that they represent, called quasidialectical sets. In particular, we prove that the quasidialectical sets are ${\rm{\Delta }}_2^0$ sets in the arithmetical hierarchy. We distinguish between “loopless” quasidialectal systems, and quasidialectical systems “with loops”. The latter ones represent exactly those coinfinite c.e. sets, that are not simple. In a subsequent paper we will show that whereas the dialectical sets are ω-c.e., the quasidialectical sets spread out throughout all classes of the Ershov hierarchy of the ${\rm{\Delta }}_2^0$ sets.
Rating and Likert scales are widely used in evaluation experiments to measure the quality of Natural Language Generation (NLG) systems. We review the use of rating and Likert scales for NLG evaluation tasks published in NLG specialized conferences over the last ten years (135 papers in total). Our analysis brings to light a number of deviations from good practice in their use. We conclude with some recommendations about the use of such scales. Our aim is to encourage the appropriate use of evaluation methodologies in the NLG community.
This paper is a continuation of Amidei, Pianigiani, San Mauro, Simi, & Sorbi (2016), where we have introduced the quasidialectical systems, which are abstract deductive systems designed to provide, in line with Lakatos’ views, a formalization of trial and error mathematics more adherent to the real mathematical practice of revision than Magari’s original dialectical systems. In this paper we prove that the two models of deductive systems (dialectical systems and quasidialectical systems) have in some sense the same information content, in that they represent two classes of sets (the dialectical sets and the quasidialectical sets, respectively), which have the same Turing degrees (namely, the computably enumerable Turing degrees), and the same enumeration degrees (namely, the ${\rm{\Pi }}_1^0$ enumeration degrees). Nonetheless, dialectical sets and quasidialectical sets do not coincide. Even restricting our attention to the so-called loopless quasidialectical sets, we show that the quasidialectical sets properly extend the dialectical sets. As both classes consist of ${\rm{\Delta }}_2^0$ sets, the extent to which the two classes differ is conveniently measured using the Ershov hierarchy: indeed, the dialectical sets are ω-computably enumerable (close inspection also shows that there are dialectical sets which do not lie in any finite level; and in every finite level n ≥ 2 of the Ershov hierarchy there is a dialectical set which does not lie in the previous level); on the other hand, the quasidialectical sets spread out throughout all classes of the hierarchy (close inspection shows that for every ordinal notation a of a nonzero computable ordinal, there is a quasidialectical set lying in ${\rm{\Sigma }}_a^{ - 1}$, but in none of the preceding levels).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.