Good‐turing frequency estimation without tears*

Gale, William A.; Sampson, Geoffrey

doi:10.1080/09296179508590051

Cited by 244 publications

(167 citation statements)

References 16 publications

Supporting

Mentioning

159

Contrasting

Order By: Relevance

“…From the full version of Table 1 we have N = rNr = 1320515 and N1 = 103978. Thus the Turing-Good estimate [8] of the amount of the probability mass missing is N 1 /N ≈ 0.079 or 7.9%. This tells us that our estimate of the distribution of login frequencies is reasonably accurate, in that the bulk of the mass has been captured.…”

Section: How Many Login Urls Are There?mentioning

confidence: 99%

See 1 more Smart Citation

A large-scale study of web password habits

Florêncio

Herley

2007

Proceedings of the 16th International Conference on World Wide Web

761

548

View full text Add to dashboard Cite

We report the results of a large scale study of password use and password re-use habits. The study involved half a million users over a three month period. A client component on users' machines recorded a variety of password strength, usage and frequency metrics. This allows us to measure or estimate such quantities as the average number of passwords and average number of accounts each user has, how many passwords she types per day, how often passwords are shared among sites, and how often they are forgotten. We get extremely detailed data on password strength, the types and lengths of passwords chosen, and how they vary by site. The data is the first large scale study of its kind, and yields numerous other insights into the rôle the passwords play in users' online experience.

show abstract

Section: How Many Login Urls Are There?mentioning

confidence: 99%

“…The standard means of estimating the probability mass of unseen species in a limited observation is the Good-Turing estimate [8]. From the full version of Table 1 we have N = rNr = 1320515 and N1 = 103978.…”

Section: How Many Login Urls Are There?mentioning

confidence: 99%

A large-scale study of web password habits

Florêncio

Herley

2007

Proceedings of the 16th International Conference on World Wide Web

761

548

View full text Add to dashboard Cite

show abstract

“…It also allows for missing bins and for the fact that the observed numbers are noisy estimates (i.e., subject to measurement error). These calculations are considerably more complicated (see Gale & Sampson, 1995, for an introduction), but can be circumvented by making use of existing software packages.…”

Section: The Good-turing Algorithmmentioning

confidence: 99%

Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice

Brysbaert

Diependaele

2012

Behav Res

View full text Add to dashboard Cite

In a critical review of the heuristics used to deal with zero word frequencies, we show that four are suboptimal, one is good, and one may be acceptable. The four suboptimal strategies are discarding words with zero frequencies, giving words with zero frequencies a very low frequency, adding 1 to the frequency per million, and making use of the Good-Turing algorithm. The good algorithm is the Laplace transformation, which consists of adding 1 to each frequency count and increasing the total corpus size by the number of word types observed. A strategy that may be acceptable is to guess the frequency of absent words on the basis of other corpora and then increasing the total corpus size by the estimated summed frequency of the missing words. A comparison with the lexical decision times of the English Lexicon Project and the British Lexicon Project suggests that the Laplace transformation gives the most useful estimates (in addition to being easy to calculate). Therefore, we recommend it to researchers.Keywords Word frequency . Laplace transformation . Good-Turing algorithm . Zero frequency One of the thorny issues in word recognition studies arises when researchers want to use words not present in their preferred word frequency list. Although it is tempting to assign such words a frequency of 0, this creates problems when one needs the logarithms of the frequencies, because the logarithm of 0 goes to minus infinity and, therefore, is not given by most calculators or software packages. As usual, when confronted with this type of mathematical nuisance, psychology researchers have developed a number of heuristics, which are passed on from one generation to the next without much justification. The practice commonly elicits probing questions from new, critical students, but they rapidly learn to adapt when they realize that finding answers is not trivial and risks detracting from their real research. One would expect the providers of word frequency lists to give some guidance, but to our knowledge, this has not happened so far.It might be argued that the problem of zero frequencies is likely to disappear in the near future, given that word frequency measures are calculated on increasingly large collections of materials. Indeed, one would not expect an interesting word to be absent from a corpus of more than one hundred billion words, such as the Google Books corpus (Michel et al., 2011). This is true, but analyses have indicated that frequency measures based on such large (Internet-based) corpora are not the best to predict word-processing times in psycholinguistic studies. More variance in word-processing performance is accounted for by frequency estimates from smaller corpora that are more representative of the language that the participants of psychology experiments have been exposed to (Brysbaert, Keuleers, & New, 2011). Although frequency measures based on very large corpora provide estimates for all words, they do not provide very good estimates.There are two reasons why word frequencies from very large ...

show abstract

“…We decided to use the frequencies from the subtitle corpus, because we think it gives a more accurate image of everyday language, which is the language FFL teaching is mainly concerned with. The frequencies were changed into probabilities, and smoothed with the Simple Good-Turing algorithm described by Gale and Sampson (1995). This step is necessary to solve another well-known problem in language models: the appearance in a new text of previously unseen lemmas.…”

Section: The Language Model: Probabilities and Smoothingmentioning

confidence: 99%

Combining a statistical language model with logistic regression to predict the lexical and syntactic difficulty of texts for FFL

Thomas

2009

Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research W

View full text Add to dashboard Cite

Reading is known to be an essential task in language learning, but finding the appropriate text for every learner is far from easy. In this context, automatic procedures can support the teacher's work. Some tools exist for English, but at present there are none for French as a foreign language (FFL). In this paper, we present an original approach to assessing the readability of FFL texts using NLP techniques and extracts from FFL textbooks as our corpus. Two logistic regression models based on lexical and grammatical features are explored and give quite good predictions on new texts. The results shows a slight superiority for multinomial logistic regression over the proportional odds model.

show abstract

Good‐turing frequency estimation without tears*

Cited by 244 publications

References 16 publications

A large-scale study of web password habits

A large-scale study of web password habits

Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice

Combining a statistical language model with logistic regression to predict the lexical and syntactic difficulty of texts for FFL

Contact Info

Product

Resources

About