Linguistically-augmented perplexity-based data selection for language models

Toral, Antonio; Pecina, Pavel; Wang, Longyue; Genabith, Josef van

doi:10.1016/j.csl.2014.10.002

Cited by 17 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LDA technique has been widely explored to form unsupervised adapted language model [17] and topic-specific language models for inflectional languages [18]. For many languages the linguistic word level approach [19], syntactico-statistical approach [19] and statistic phrase level approach [20] has been used to build an adapted language model for improving the speech recognition rate. Document retrieval from web content [21, 22.…”

Section: Related Workmentioning

confidence: 99%

Efficient Search Mechanism from Large Scale Corpora for Domain-Specific Language Modeling in Speech Recognition

2019

ijeat

View full text Add to dashboard Cite

With the Internet and the World Wide Web revolution, large corpora in variety of forms are germinating ceaselessly that can be manifested as big data. One obligatory area for the usage of such large corpora is language modeling for large vocabulary continuous speech recognition. Language modeling is an indispensable module in speech recognition architecture, which plays a vital role in reducing the search space during the recognition process. Additionally, the language model that is contiguous to the domain of the speech can dwindle the search space and escalate the recognition accuracy. In this paper, an efficient searching mechanism for domain-specific document retrieval from the large corpora has been elucidated using Elasticsearch which is a distributed and an efficient search engine for big data. This assisted us in tuning the language model in accordance with the domain and also by reducing the search time by more than 90% in comparison to conventional search and retrieval mechanism used in our earlier work. A word level and a phrase level retrieval process for creating domain-specific language model has been implemented. The evaluation of the system is performed on the basis of word error rate (WER) and perplexity (PPL) of the speech recognition system. The results shows nearly 10% decrease on WER and a major reduction in the PPL that helped in boosting the performance of the speech recognition process. From the results, it can be consummated that Elasticsearch is an efficient mechanism for domain specific document retrieval from large corpora rather than using topic modeling toolkits

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Search Mechanism from Large Scale Corpora for Domain-Specific Language Modeling in Speech Recognition

2019

ijeat

View full text Add to dashboard Cite

show abstract

“…Many linguistically rich languages have come up with the word-level linguistic approach for the generation of a better LM. Toral et al [15] have used word-level linguistic units such as lemmas, Named Entity Recognition (NER) and Parts of Speech (POS) tags. In this paper [15], two kinds of LMs are created: domain-specific corpus and random subset of general corpus, which is of same size as the domainspecific corpus.…”

Section: Related Workmentioning

confidence: 99%

Ameliorated language modelling for lecture speech recognition of Indian English

Phull

Kumar

2018

Sādhanā

View full text Add to dashboard Cite

A great amount of research is growing towards the automatic transcription of lectures that consist of numerous information and knowledge that could be helpful to the educational systems and institutes. In large vocabulary speech recognition, language model plays a paramount role in reducing the humongous search space. However, language modelling is very brittle when moving from one domain to another or when moving from read speech to spontaneous speech. Also, lecture speech recognition will have some of the characteristics of spontaneous speech. Hence, it is very challenging to build the language model for this task. In this paper, a judicious approach to adapt the language model in a way where the language model will be in close proximity to the topic spoken in the lecture speech has been depicted. The evaluation of the language model is devised using the proposed approach with the existing language models such as CMU Sphinx, Gigaword and HUB-4. We observed the results analysis that the language models devised from the proposed approach outperform from the existing language models in terms of word error rate, perplexity and out of vocabulary rate. Analysis shows that the presented two-phase approach has resulted in an average decrease of the word error rate to be approximately 14% and the perplexity is decreased by half on average.

show abstract

“…In Mansour et al (2011), the cross-entropy score is used for language model filtering together with a translation model score that estimates the likelihood that a source and a target sentence are a translation of each other. Toral et al (2015) introduced linguistic information such as lemmas, named entities and part-of-speech tags into the preprocessing of the data and then ranked the sentences by perplexity.…”

Section: Related Workmentioning

confidence: 99%

Data Selection for IT Texts using Paragraph Vector

Duma¹,

Menzel

2016

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

View full text Add to dashboard Cite

This paper presents an overview of the system submitted by the University of Hamburg to the IT domain shared translation task as part of the ACL 2016 First Conference of Machine Translation (WMT 2016). We have chosen data selection as a domain adaptation method. The filtering of the general domain data makes use of paragraph vectors as a novel approach for scoring the sentences. Experiments were conducted for English-German under the constrained condition.

show abstract

Linguistically-augmented perplexity-based data selection for language models

Cited by 17 publications

References 12 publications

Efficient Search Mechanism from Large Scale Corpora for Domain-Specific Language Modeling in Speech Recognition

Efficient Search Mechanism from Large Scale Corpora for Domain-Specific Language Modeling in Speech Recognition

Ameliorated language modelling for lecture speech recognition of Indian English

Data Selection for IT Texts using Paragraph Vector

Contact Info

Product

Resources

About