We research n-gram dictionaries and estimate its coverage and entropy based on the web corpus of English. We consider a method for estimating the coverage of empirically generated dictionaries and an approach to address the disadvantage of low coverage. Based on the ideas of Kolmogorov’s combinatorial approach, we estimate the n-gram entropy of the English language and use mathematical extrapolation to approximate the marginal entropy. In addition, we approximate the number of all possible legal n-grams in the English language for large order of n-grams.
We estimate the n-gram entropies of English- language texts, using dictionaries and taking into account punctuation, and find a heuristic method for estimating the marginal entropy. We propose a method for evaluating the coverage of empirically generated dictionaries and an ap- proach to address the disadvantage of low coverage. In ad- dition, we compare the probability of obtaining a meaning- ful text by directly iterating through all possible n-grams of the alphabet and conclude that this is only possible for very short text segments.
При исследовании криптографических качеств алгоритмов защиты информации важным моментом является построение теоретических и экспериментальных моделей источников сообщений. В данной статье проводится статистический анализ свойств лексических и n-граммных моделей русского языка на основе новостного текстового корпуса. Создан специализированный корпус из новостных статей последних лет политической направленности, отражающий узкую область употребления языка. Составлены словари токенов и n-грамм, найдены величины покрытия этих словарей, а также значения энтропии. Проведена лемматизация исходного текстового корпуса и экстраполяция роста объёма словарей в зависимости от увеличения размера корпуса.Ключевые слова: словари n-грамм, энтропия n-грамм, осмысленные тексты.Библиография: 15 названий.
The paper studies the procedure for restoring discreet segments of an unknown source message based on information about possible variants of each sign. An algorithm is proposed based on compiling dictionaries of appropriate lengths, searching for text sections with a total number of character variants that do not exceed a given boundary, and then iterating through and eliminating false variants of dictionary values. Statistical properties of short-length text dictionaries are investigated, and extrapolation estimates are made for long-length texts. The main mathematical properties of this algorithm are described. Theoretical studies of the effectiveness of the procedure under consideration are carried out within the framework of a certain probability-theoretical model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.