6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018) 2018
DOI: 10.21437/sltu.2018-13
|View full text |Cite
|
Sign up to set email alerts
|

Mining Training Data for Language Modeling Across the World's Languages

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(10 citation statements)
references
References 0 publications
0
10
0
Order By: Relevance
“…We therefore use smaller language models over shorter contexts. -Word language models: For languages using spaces to separate words, we also use a word-based language model trained on a similar corpus as the character language models [4,39], using 3-grams pruned to between 1.25 million and 1.5 million entries. -Character classes: We add a scoring heuristic which boosts the score of characters from the language's alphabet.…”
Section: Feature Functions: Language Models and Character Classesmentioning
confidence: 99%
“…We therefore use smaller language models over shorter contexts. -Word language models: For languages using spaces to separate words, we also use a word-based language model trained on a similar corpus as the character language models [4,39], using 3-grams pruned to between 1.25 million and 1.5 million entries. -Character classes: We add a scoring heuristic which boosts the score of characters from the language's alphabet.…”
Section: Feature Functions: Language Models and Character Classesmentioning
confidence: 99%
“…Perhaps the simplest way to make a G2P system is to create a list of all the graphemes in the target language, e.g. from a source such as [3]. Then, for each grapheme, we assign the phoneme that most commonly corresponds to this grapheme in other languages for which we do have human-curated pronunciations or rule-based G2P, restricting ourselves to the phonemes that are known to appear in the target language.…”
Section: Inducing Rule-based G2p Finite-state Transducersmentioning
confidence: 99%
“…Over 7,000 languages are spoken in our world today, out of which nearly 4,000 are known to have a written form [1]. And in fact, text data can easily be found online in well over 2,000 languages [2,3]. However, automatic speech recognition (ASR) systems are available only for around 100 language varieties.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…With an induced number names grammar and a customized template verbalizer, basic verbalizations of major semiotic classes can be produced without the need for complex custom grammars, paving the way to scaling verbalization to more languages in the future. This research forms part of a wider research effort at Google investigating how language technology can be scaled to more languages quickly [8,9,10,11].…”
Section: Introductionmentioning
confidence: 99%