Massively multilingual pre-trained language models (MMPLMs) are developed in recent years demonstrating super powers and the preknowledge they acquire for downstream tasks. In this work, we investigate whether MM-PLMs can be applied to zero-shot machine translation (MT) towards entirely new language pairs and new domains. We carry out experimental investigation using Meta-AI's MM-PLMs "wmt21-dense-24-wide-en-X and X-en (WMT21fb)" which were pre-trained on 7 language pairs and 14 translation directions including English to Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese, and opposite direction. We fine-tune these MM-PLMs towards English-Spanish language pair which did not exist at all in their original pre-trained corpora both implicitly and explicitly. We prepare carefully aligned clinical domain data for this fine-tuning, which is different from their original mixed domain knowledge as well. Our experimental result shows that the fine-tuning is very successful using just 250k well-aligned in-domain EN-ES pairs / sentences for three sub-task translation testings: clinical cases, clinical terms and ontology concepts. It achieves very close evaluation scores to another MMPLM NLLB from Meta-AI, which included Spanish as a highresource setting in the pre-training. To the best of our knowledge, this is the first work on using MMPLMs towards real zero-shot NMT successfully for totally unseen languages during pre-training, and also the first in clinical domain for such study.
Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output based on professional post-editing annotations. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at https://github.com/lHan87/HOPE.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.