Language model adaptation with additional text generated by machine translation

Nakajima, Hideharu; Yamamoto, Hirotsugu; Watanabe, Takahiro

doi:10.3115/1072228.1072392

Cited by 7 publications

(6 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The collection of textual data in a given language (and for a given domain) is also a hot topic that can be addressed using the Web as a corpus (Le et al, 2003;Cai, 2008) or using machine translation systems to port text corpora from one language to another (Nakajima et al, 2002;Jensson, 2008;Suenderman and Liscombe, 2009;Cucu et al, 2012). However, one faces specific problems, when developing language models for some underresourced languages.…”

Section: Web or Translation-based Text Data Collectionmentioning

confidence: 99%

Automatic speech recognition for under-resourced languages: A survey

Besacier

Barnard

Karpov

et al. 2014

Speech Communication

393

195

View full text Add to dashboard Cite

Speech processing for under-resourced languages is an active field of research, which has experienced significant progress during the past decade. We propose, in this paper, a survey that focuses on automatic speech recognition (ASR) for these languages. The definition of under-resourced languages and the challenges associated to them are first defined. The main part of the paper is a literature review of the recent (last 8 years) contributions made in ASR for under-resourced languages. Examples of past projects and future trends when dealing with under-resourced languages are also presented. We believe that this paper will be a good starting point for anyone interested to initiate research in (or operational development of) ASR for one or several under-resourced languages. It should be clear, however, that many of the issues and approaches presented here, apply to speech technology in general (text-to-speech synthesis for instance).

show abstract

Section: Web or Translation-based Text Data Collectionmentioning

confidence: 99%

Automatic speech recognition for under-resourced languages: A survey

Besacier

Barnard

Karpov

et al. 2014

Speech Communication

393

195

View full text Add to dashboard Cite

show abstract

“…Unsupervised language model domain adaptation using SMT (English to Japanese) text was proposed back in 2002 by Nakajima [12]. This paper only reports language model perplexity results, without investigating the implications on a full ASR system.…”

Section: Related Work On Smt-based Domain Adaptation For Asrmentioning

confidence: 99%

“…This issue was recently dealt with for some under-resourced languages such as Thai [7], Amharic [8] and Vietnamese [3]. This is not only true for under-resourced languages, but the collection of textual data in a given language (and for a given domain) is also a hot topic that can be addressed using the Web as a corpus [9,10,11] or using machine translation systems to port text corpora from one language to another [12,13,14].…”

Section: Introductionmentioning

confidence: 99%

SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Cucu

Buzo

Besacier

et al. 2014

Speech Communication

View full text Add to dashboard Cite

This study investigates the possibility of using statistical machine translation to create domainspecific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domainadaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary.

show abstract

“…Beyond this, we bootstrap the syntax-based language model using the additional data generated by a syntax-based MT system. To our knowledge, the only previous work addressing this issue is Nakajima et al [2002]. They adapted an n-gram language model with the data generated by a word-based MT system.…”

Section: Related Workmentioning

confidence: 99%

Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars

Xiao

Zhu

2011

ACM Transactions on Asian Language Information Processing

View full text Add to dashboard Cite

The poor grammatical output of Machine Translation (MT) systems appeals syntax-based approaches within language modeling. However, previous studies showed that syntax-based language modeling using (ContextFree) Treebank Grammars was not very helpful in improving BLEU scores for Chinese-English machine translation. In this article we further study this issue in the context of Chinese-English syntax-based Statistical Machine Translation (SMT) where Synchronous Tree Substitution Grammars (STSGs) are utilized to model the translation process. In particular, we develop a Tree Substitution Grammar-based language model for syntax-based MT, and present three methods to efficiently integrate the proposed language model into MT decoding. In addition, we design a simple and effective method to adapt syntax-based language models for MT tasks. We demonstrate that the proposed methods are able to benefit a state-of-the-art syntax-based MT system. On the NIST Chinese-English MT evaluation corpora, we finally achieve an improvement of 0.6 BLEU points over the baseline. ACM Reference Format:Xiao, T., Zhu, J., and Zhu, M. 2011. Language modeling for syntax-based machine translation using tree substitution grammars: A case study on Chinese-English translation. ACM Trans. Asian Lang. Inform.

show abstract

Language model adaptation with additional text generated by machine translation

Cited by 7 publications

References 7 publications

Automatic speech recognition for under-resourced languages: A survey

Automatic speech recognition for under-resourced languages: A survey

SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars

Contact Info

Product

Resources

About