No abstract
Language model adaptation based on Machine Translation (MT) is a recently proposed approach to improve the Automatic Speech Recognition (ASR) of spoken translations that does not suffer from a common problem in approaches based on rescoring i.e. errors made during recognition cannot be recovered by the MT system.In previous work we presented an efficient implementation for MT-based language model adaptation using a word-based translation model. By omitting renormalization and employing weighted updates, the implementation exhibited virtually no adaptation overhead, enabling its use in a real-time setting.In this paper we investigate whether we can improve recognition accuracy without sacrificing the achieved efficiency. More precisely, we investigate the effect of both state-of-the-art phrase-based translation models and named entity probability estimation. We report relative WER reductions of 6.2% over a word-based LM adaptation technique and 25.3% over an unadapted 3-gram baseline on an English-to-Dutch dataset.
To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR-NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project.
Semantic knowledge has been adopted recently for SMT preprocessing, decoding and evaluation, in order to be able to compare sentences based on their meaning rather than on mere lexical and syntactic similarity. Little attention has been paid to semantic knowledge in the context of integrating fuzzy matches from a translation memory with SMT. We present work in progress which focuses on semantics-based pretranslation before decoding in SMT. This involves applying fuzzy matching metrics based on lexical semantics and semantic roles, aligning parse trees based on semantic roles, and pretranslating matching source sentence parts using aligned tree nodes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.