2003
DOI: 10.1007/3-540-36456-0_40
|View full text |Cite
|
Sign up to set email alerts
|

A Corpus Balancing Method for Language Model Construction

Abstract: Abstract. The language model is an important component of any speech recognition system. In this paper, we present a lexical enrichment methodology of corpora focused o n the construction of statistical language models. This methodology co nsiders, on one hand, the identification of the set of poor represented words of a given training corpus, and on the other hand, the enrichment of the given co rpus by the repetitive inclusion of selected text fragments containing these words. The first part of the paper des… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2003
2003
2008
2008

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 5 publications
0
7
0
Order By: Relevance
“…This combination is based on the pertinence of the translations to the target document collection. This pertinence, as in the previous method, expresses how a given translation fits in 3 The n-gram model was constructed using the method described in [15].…”
Section: Methods 2: "Combining Passages From Several Translations"mentioning
confidence: 99%
“…This combination is based on the pertinence of the translations to the target document collection. This pertinence, as in the previous method, expresses how a given translation fits in 3 The n-gram model was constructed using the method described in [15].…”
Section: Methods 2: "Combining Passages From Several Translations"mentioning
confidence: 99%
“…In particular, in Mexico there have been some interesting efforts related to the use of the web for the automatic construction of domain-specific ontologies [16], training sets for text classification tasks [6,7], and language models for speech recognition [28]. The following sections give a brief overview of these works.…”
Section: Extracting Information From the Webmentioning
confidence: 99%
“…The construction of this corpus is not a simple task since written texts do not represent adequately many phenomena of spontaneous speech. In order to alleviate this problem, [28] proposes the use of web documents as data source. This proposal was based on the fact that many people around the world contribute to create the web, and therefore, that most of its documents comprise informal contents and include many everyday as well as non-grammatical expressions used in spoken language.…”
Section: Tuning Task-specific Language Models Through Web Datamentioning
confidence: 99%
See 1 more Smart Citation
“…In addition to Keller and Lapata (this issue) and references therein, Volk (2001) gathers lexical statistics for resolving prepositional phrase attachments, and Villasenor-Pineda et al (2003) "balance" their corpus using Web documents.…”
Section: Some Current Themesmentioning
confidence: 99%