This paper describes our Word-level QE system for WMT 2014 shared task on Spanish -English pair. Compared to WMT 2013, this year's task is different due to the lack of SMT setting information and additional resources.We report how we overcome this challenge to retain most of the important features which performed well last year in our system. Novel features related to the availability of multiple systems output (new point of this year) are also proposed and experimented along with baseline set. The system is optimized by several ways: tuning the classification threshold, combining with WMT 2013 data, and refining using Feature Selection strategy on our development set, before dealing with the test set for submission.
Information about the antecedents of pronouns is considered essential to solve certain translation divergencies, such as those concerning the English pronoun it when translated into gendered languages, e.g. for French into il, elle, or several other options. However, no machine translation system using anaphora resolution has so far been able to outperform a phrase-based statistical MT baseline. We address here one of the reasons for this failure: the imperfection of automatic anaphora resolution algorithms. Using parallel data, we learn probabilistic correlations between target-side pronouns and the gender and number features of their (uncertain) antecedents, as hypothesized by the Stanford Coreference Resolution system on the source side. We embody these correlations into a secondary translation model, which we invoke upon decoding with the Moses statistical phrase-based MT system. This solution outperforms a deterministic pronoun post-editing system, as well as a statistical MT baseline, on automatic and human evaluation metrics.
The Idiap NLP Group has participated in both DiscoMT 2015 sub-tasks: pronounfocused translation and pronoun prediction. The system for the first sub-task combines two knowledge sources: grammatical constraints from the hypothesized coreference links, and candidate translations from an SMT decoder. The system for the second sub-task avoids hypothesizing a coreference link, and uses instead a large set of source-side and target-side features from the noun phrases surrounding the pronoun to train a pronoun predictor.
We implement a fully probabilistic model to combine the hypotheses of a Spanish anaphora resolution system with those of a Spanish-English machine translation system. The probabilities over antecedents are converted into probabilities for the features of translated pronouns, and are integrated with phrase-based MT using an additional translation model for pronouns. The system improves the translation of several Spanish personal and possessive pronouns into English, by solving translation divergencies such as ella → she | it or su → his | her | its | their. On a test set with 2,286 pronouns, a baseline system correctly translates 1,055 of them, while ours improves this by 41. Moreover, with oracle antecedents, possessives are translated with an accuracy of 83%.
This paper proposes to use Word Confidence Estimation (WCE) information to improve MT outputs via N-best list reranking. From the confidence label assigned for each word in the MT hypothesis, we add six scores to the baseline loglinear model in order to re-rank the N-best list. Firstly, the correlation between the WCE-based sentence-level scores and the conventional evaluation scores (BLEU, TER, TERp-A) is investigated. Then, the N-best list re-ranking is evaluated over different WCE system performance levels: from our real and efficient WCE system (ranked 1st during last WMT 2013 Quality Estimation Task) to an oracle WCE (which simulates an interactive scenario where a user simply validates words of a MT hypothesis and the new output will be automatically re-generated). The results suggest that our real WCE system slightly (but significantly) improves the baseline while the oracle one extremely boosts it; and better WCE leads to better MT quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.