Two decades of statistical language modeling: where do we go from here?

Rosenfeld, Roni

doi:10.1109/5.880083

Cited by 514 publications

(328 citation statements)

References 58 publications

Supporting

Mentioning

303

Contrasting

Unclassified

Order By: Relevance

“…The second module is Chunked-Off Markov Model [3] training the database with corpus sentences in which all the nouns and named entities are replaced with their respective type. This is implemented using the tagging and chunking operations of NLTK.…”

Section: Chunked-off Markov Modelmentioning

confidence: 99%

Sentence Validation by Statistical Language Modeling and Semantic Relations

Arya¹

2014

IJCATR

View full text Add to dashboard Cite

: This paper deals with Sentence Validation -a sub-field of Natural Language Processing. It finds various applications in different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on understanding and extracting important information delivered to the computer and make possible efficient human computer interaction. Sentence Validation is approached in two ways -by Statistical approach and Semantic approach. In both approaches database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis, tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing, and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.

show abstract

Section: Chunked-off Markov Modelmentioning

confidence: 99%

Sentence Validation by Statistical Language Modeling and Semantic Relations

Arya¹

2014

IJCATR

View full text Add to dashboard Cite

show abstract

“…The basic idea is that we assume there are k latent common themes in all collections. Each is characterized by a multinomial word distribution (also called a unigram language model [10]). We then assume that a document is a sample of a mixture model with these theme models as components.…”

Section: The General Problemmentioning

confidence: 99%

A cross-collection mixture model for comparative text mining

Zhai

Velivelli

2004

Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

225

139

View full text Add to dashboard Cite

In this paper, we define and study a novel text mining problem, which we refer to as comparative text mining. Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence, summarizing reviews of similar products, and comparing different opinions about a common topic. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.

show abstract

“…Language models (LMs) are essential for automatic speech recognition or statistical machine translation (Rosenfeld, 2000). The performance of LMs strongly depends on quality and quantity of their training data.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Latent Words Language Models for Robust Modeling to Out-Of Domain Tasks

Masumura

Asami

Oba

et al. 2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

This paper focuses on language modeling with adequate robustness to support different domain tasks. To this end, we propose a hierarchical latent word language model (h-LWLM). The proposed model can be regarded as a generalized form of the standard LWLMs. The key advance is introducing a multiple latent variable space with hierarchical structure. The structure can flexibly take account of linguistic phenomena not present in the training data. This paper details the definition as well as a training method based on layer-wise inference and a practical usage in natural language processing tasks with an approximation technique. Experiments on speech recognition show the effectiveness of h-LWLM in out-of domain tasks.

show abstract

Two decades of statistical language modeling: where do we go from here?

Cited by 514 publications

References 58 publications

Sentence Validation by Statistical Language Modeling and Semantic Relations

Sentence Validation by Statistical Language Modeling and Semantic Relations

A cross-collection mixture model for comparative text mining

Hierarchical Latent Words Language Models for Robust Modeling to Out-Of Domain Tasks

Contact Info

Product

Resources

About