Survey of data-selection methods in statistical machine translation

Eetemadi, Sauleh; Lewis, William D.; Toutanova, Kristina; Radha, Hayder

doi:10.1007/s10590-015-9176-1

Cited by 43 publications

(15 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, since the size of the data sets that participants must produce in this task is smaller than the number of parallel sentences that are mutual translations, this task is also related to the data selection: selection of a subset of data that maximizes translation quality, avoiding redundancy and matching a given domain (Eetemadi et al, 2015). Instead of the widespread language-model based data selection methods (Axelrod et al, 2011), we replaced words with placeholders in order to not take into account the domain of the text.…”

Section: Related Workmentioning

confidence: 99%

Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Sánchez-Cartagena¹,

Bañón²,

Ortiz-Rojas³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Shared Task Papers

View full text Add to dashboard Cite

This paper describes Prompsit Language Engineering's submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs of sentences that are mutual translations. A set of hand-crafted hard rules for discarding sentences with evident flaws were applied before the classifier. We explored different strategies for achieving a training corpus with diverse vocabulary and fluent sentences: language model scoring, an active-learning-inspired data selection algorithm and n-gram saturation. Our submissions were very competitive in comparison with other participants on the 100 million word training corpus.

show abstract

Section: Related Workmentioning

confidence: 99%

Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Sánchez-Cartagena¹,

Bañón²,

Ortiz-Rojas³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Shared Task Papers

View full text Add to dashboard Cite

show abstract

“…Among different data selection techniques (Eetemadi et al, 2015), in this work, we focus on three particular methods: Cross Entropy Difference (Section 2.1), TF-IDF Data Selection (Section 2.2), and Feature Decay Algorithms (Section 2.3).…”

Section: Data Selection Methodsmentioning

confidence: 99%

Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods

Silva¹,

Liu²,

Poncelas³

et al. 2018

Proceedings of the Third Conference on Machine Translation: Research Papers

View full text Add to dashboard Cite

Data selection is a process used in selecting a subset of parallel data for the training of machine translation (MT) systems, so that 1) resources for training might be reduced, 2) trained models could perform better than those trained with the whole corpus, and/or 3) trained models are more tailored to specific domains. It has been shown that for statistical MT (SMT), the use of data selection helps improve the MT performance significantly. In this study, we reviewed three data selection approaches for MT, namely Term Frequency-Inverse Document Frequency, Cross-Entropy Difference and Feature Decay Algorithm, and conducted experiments on Neural Machine Translation (NMT) with the selected data using the three approaches. The results showed that for NMT systems, using data selection also improved the performance, though the gain is not as much as for SMT systems.

show abstract

“…Eetemadi et al [25] offered a complex survey of data selection methods in machine translation. They also describe works focusing on cross-entropy which has become the most commonly used approach in data selection.…”

Section: Related Workmentioning

confidence: 99%

Towards the use of entropy as a measure for the reliability of automatic MT evaluation metrics

Munk

Munková

Benko

2018

IFS

View full text Add to dashboard Cite

The study describes an experiment with different estimations of reliability. Reliability reflects the technical quality of the measurement procedure such as an automatic evaluation of Machine Translation (MT). Reliability is an indicator of accuracy, the reliability of measuring, in our case, measuring the accuracy and error rate of MT output based on automatic metrics (precision, recall, f-measure, Bleu-n, WER, PER, and CDER). The experiment showed metrics (Bleu-4 and WER) that reduce the overall reliability of the automatic evaluation of accuracy and error rate using entropy. Based on the results we can say, that the use of entropy for the estimation of reliability brings more accurate results than conventional estimations of reliability (Cronbach's alpha and correlation). MT evaluation, based on n-grams or edit distance, using entropy could offer a new view on lexicon-based metrics in comparison to commonly used ones.

show abstract

Survey of data-selection methods in statistical machine translation

Cited by 43 publications

References 30 publications

Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods

Towards the use of entropy as a measure for the reliability of automatic MT evaluation metrics

Contact Info

Product

Resources

About