2015
DOI: 10.1016/j.csl.2014.10.002
|View full text |Cite
|
Sign up to set email alerts
|

Linguistically-augmented perplexity-based data selection for language models

Abstract: This paper explores the use of linguistic information for the selection of data to train language models. We depart from the stateof-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naïve selection of the top ranked sentences selected by each method… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
8
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(9 citation statements)
references
References 12 publications
0
8
0
Order By: Relevance
“…LDA technique has been widely explored to form unsupervised adapted language model [17] and topic-specific language models for inflectional languages [18]. For many languages the linguistic word level approach [19], syntactico-statistical approach [19] and statistic phrase level approach [20] has been used to build an adapted language model for improving the speech recognition rate. Document retrieval from web content [21, 22.…”
Section: Related Workmentioning
confidence: 99%
“…LDA technique has been widely explored to form unsupervised adapted language model [17] and topic-specific language models for inflectional languages [18]. For many languages the linguistic word level approach [19], syntactico-statistical approach [19] and statistic phrase level approach [20] has been used to build an adapted language model for improving the speech recognition rate. Document retrieval from web content [21, 22.…”
Section: Related Workmentioning
confidence: 99%
“…Many linguistically rich languages have come up with the word-level linguistic approach for the generation of a better LM. Toral et al [15] have used word-level linguistic units such as lemmas, Named Entity Recognition (NER) and Parts of Speech (POS) tags. In this paper [15], two kinds of LMs are created: domain-specific corpus and random subset of general corpus, which is of same size as the domainspecific corpus.…”
Section: Related Workmentioning
confidence: 99%
“…In Mansour et al (2011), the cross-entropy score is used for language model filtering together with a translation model score that estimates the likelihood that a source and a target sentence are a translation of each other. Toral et al (2015) introduced linguistic information such as lemmas, named entities and part-of-speech tags into the preprocessing of the data and then ranked the sentences by perplexity.…”
Section: Related Workmentioning
confidence: 99%