Ciprian Chelba scite author profile

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6. A combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline.The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

show abstract

Structured language modeling

Chelba

Jelinek

2000

Computer Speech & Language

203

206

View full text Add to dashboard Cite

This paper presents an attempt at using the syntactic structure in natural language for improved language models for speech recognition. The structured language model merges techniques in automatic parsing and language modeling using an original probabilistic parameterization of a shift-reduce parser. A maximum likelihood re-estimation procedure belonging to the class of expectation-maximization algorithms is employed for training the model. Experiments on the Wall Street Journal and Switchboard corpora show improvement in both perplexity and word error rate-word lattice rescoring-over the standard 3-gram language model.

show abstract

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Chelba¹,

Mikolov²,

Schuster³

et al. 2013

Preprint

150

198

View full text Add to dashboard Cite

show abstract

Adaptation of maximum entropy capitalizer: Little data can help a lot

Chelba

Acero

2006

Computer Speech & Language

150

142

View full text Add to dashboard Cite

“Your Word is my Command”: Google Search by Voice: A Case Study

Schalkwyk

Beeferman²,

Beaufays³

et al. 2010

204

129

View full text Add to dashboard Cite

An important goal at Google is to make spoken access ubiquitously available. Achieving ubiquity requires two things: availability (i.e., built into every possible interaction where speech input or output can make sense) and performance (i.e., works so well that the modality adds no friction to the interaction).This chapter is a case study of the development of Google Search by Voice -a step toward our long-term vision of ubiquitous access. While the integration of speech input into Google search is a significant step toward more ubiquitous access, it has posed many problems in terms of the performance of core speech technologies and the design of effective user interfaces. Work is ongoing and no doubt the problems are far from solved. Nonetheless, we have at the minimum achieved a level of performance showing that usage of voice search is growing rapidly, and that many users do indeed become repeat users.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ciprian Chelba

One billion word benchmark for measuring progress in statistical language modeling

Structured language modeling

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Adaptation of maximum entropy capitalizer: Little data can help a lot

“Your Word is my Command”: Google Search by Voice: A Case Study

Contact Info

Product

Resources

About