We propose a reference-less metric trained on manual evaluations of system outputs for grammatical error correction. Previous studies have shown that reference-less metrics are promising; however, existing metrics are not optimized for manual evaluation of the system output because there is no dataset of system output with manual evaluation. This study manually evaluates the output of grammatical error correction systems to optimize the metrics. Experimental results show that the proposed metric improves the correlation with manual evaluation in both systemand sentence-level meta-evaluation. Our dataset and metric will be made publicly available. 1 2 Related Work pioneered the reference-less GEC metric. They presented a metric based on grammatical error detection tools and linguistic features such as language models, and demonstrated that its performance was close to that of reference-based metrics. Asano et al. (2017) combined three submetrics: grammaticality, fluency, and meaning preservation, and outperformed reference-based metrics. They trained a logistic regression model on the GUG dataset 2 (Heilman et al.
In this paper, we introduce our participation in the WMT 2019 Metric Shared Task. We propose a method to filter pseudo-references by paraphrasing for automatic evaluation of machine translation (MT). We use the outputs of off-the-shelf MT systems as pseudoreferences filtered by paraphrasing in addition to a single human reference (gold reference). We use BERT fine-tuned with paraphrase corpus to filter pseudo-references by checking the paraphrasability with the gold reference. Our experimental results of the WMT 2016 and 2017 datasets show that our method achieved higher correlation with human evaluation than the sentence BLEU (Sent-BLEU) baselines with a single reference and with unfiltered pseudo-references.
The development of a reliable automatic evaluation metric of grammatical error correction (GEC) is useful for the research and development of GEC. Since it is difficult to cover all possible reference sentences, previous studies have proposed referenceless metrics. One of them achieved a higher correlation with manual evaluation than reference-based metrics by integrating metrics from the three perspectives of grammaticality, fluency, and meaning preservation. However, the correlation with the manual evaluation can be further improved because they are not considered for optimizing each metric for each manual evaluation. Therefore, in this study, we propose a method of optimizing each metric. Furthermore, we create a dataset with manual evaluation of system output that is ideal for optimization. Experimental results show
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.