Mining Training Data for Language Modeling Across the World's Languages

Prasad, Manasa; Breiner, Theresa; Esch, Daan van

doi:10.21437/sltu.2018-13

Cited by 8 publications

(10 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We therefore use smaller language models over shorter contexts. -Word language models: For languages using spaces to separate words, we also use a word-based language model trained on a similar corpus as the character language models [4,39], using 3-grams pruned to between 1.25 million and 1.5 million entries. -Character classes: We add a scoring heuristic which boosts the score of characters from the language's alphabet.…”

Section: Feature Functions: Language Models and Character Classesmentioning

confidence: 99%

Fast multi-language LSTM-based online handwriting recognition

Carbune

Gonnet

Deselaers

et al. 2020

IJDAR

117

View full text Add to dashboard Cite

We describe an online handwriting system that is able to support 102 languages using a deep neural network architecture. This new system has completely replaced our previous segment-and-decode-based system and reduced the error rate by 20-40% relative for most languages. Further, we report new state-of-the-art results on IAM-OnDB for both the open and closed dataset setting. The system combines methods from sequence recognition with a new input encoding using Bézier curves. This leads to up to 10× faster recognition times compared to our previous system. Through a series of experiments, we determine the optimal configuration of our models and report the results of our setup on a number of additional public datasets.

show abstract

Section: Feature Functions: Language Models and Character Classesmentioning

confidence: 99%

Fast multi-language LSTM-based online handwriting recognition

Carbune

Gonnet

Deselaers

et al. 2020

IJDAR

117

View full text Add to dashboard Cite

show abstract

“…Perhaps the simplest way to make a G2P system is to create a list of all the graphemes in the target language, e.g. from a source such as [3]. Then, for each grapheme, we assign the phoneme that most commonly corresponds to this grapheme in other languages for which we do have human-curated pronunciations or rule-based G2P, restricting ourselves to the phonemes that are known to appear in the target language.…”

Section: Inducing Rule-based G2p Finite-state Transducersmentioning

confidence: 99%

“…Over 7,000 languages are spoken in our world today, out of which nearly 4,000 are known to have a written form [1]. And in fact, text data can easily be found online in well over 2,000 languages [2,3]. However, automatic speech recognition (ASR) systems are available only for around 100 language varieties.…”

Section: Introductionmentioning

confidence: 99%

“…Creating an ASR system in the finite-state transduction framework summarized in [4] requires an acoustic model, a grapheme-to-phoneme (G2P) conversion component, and a language model. [3] described a scalable approach to creating language models across many languages. In this paper, we focus on coming up with equally scalable approaches to creating the G2P component, in the vein of the work done by [5,6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages

Bleyan¹,

Ritchie

Mortensen³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

We discuss two methods that let us easily create grapheme-tophoneme (G2P) conversion systems for languages without any human-curated pronunciation lexicons, as long as we know the phoneme inventory of the target language and as long as we have some pronunciation lexicons for other languages written in the same script. We use these resources to infer what graphemeto-phoneme correspondences we would expect, and predict pronunciations for words in the target language with minimal or no language-specific human work. Our first approach uses finitestate transducers, while our second approach uses a sequenceto-sequence neural network. Our G2P models reach high degrees of accuracy, and can be used for various applications, e.g. in developing an automatic speech recognition system. Our methods greatly simplify a task that has historically required extensive manual labor.

show abstract

“…With an induced number names grammar and a customized template verbalizer, basic verbalizations of major semiotic classes can be produced without the need for complex custom grammars, paving the way to scaling verbalization to more languages in the future. This research forms part of a wider research effort at Google investigating how language technology can be scaled to more languages quickly [8,9,10,11].…”

Section: Introductionmentioning

confidence: 99%

Unified Verbalization for Speech Recognition & Synthesis Across Languages

et al. 2019

View full text Add to dashboard Cite

We describe a new approach to converting written tokens to their spoken form, which can be shared by automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems. Both ASR and TTS need to map from the written to the spoken domain, and we present an approach that enables us to share verbalization grammars between the two systems while exploiting linguistic commonalities to provide simple default verbalizations. We also describe improvements to an induction system for number names grammars. Between these shared ASR/TTS verbalizers and the improved induction system for number names grammars, we achieve significant gains in development time and scalability across languages.

show abstract

Mining Training Data for Language Modeling Across the World's Languages

Cited by 8 publications

References 0 publications

Fast multi-language LSTM-based online handwriting recognition

Fast multi-language LSTM-based online handwriting recognition

Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages

Unified Verbalization for Speech Recognition & Synthesis Across Languages

Contact Info

Product

Resources

About