Anastasia Malashina scite author profile

2021

Preprint

We research n-gram dictionaries and estimate its coverage and entropy based on the web corpus of English. We consider a method for estimating the coverage of empirically generated dictionaries and an approach to address the disadvantage of low coverage. Based on the ideas of Kolmogorov’s combinatorial approach, we estimate the n-gram entropy of the English language and use mathematical extrapolation to approximate the marginal entropy. In addition, we approximate the number of all possible legal n-grams in the English language for large order of n-grams.

Entropy analysis of n-grams and estimation of the number of meaningful language texts. Cyber security applications

2021

Preprint

We estimate the n-gram entropies of English- language texts, using dictionaries and taking into account punctuation, and find a heuristic method for estimating the marginal entropy. We propose a method for evaluating the coverage of empirically generated dictionaries and an ap- proach to address the disadvantage of low coverage. In ad- dition, we compare the probability of obtaining a meaning- ful text by directly iterating through all possible n-grams of the alphabet and conclude that this is only possible for very short text segments.

Entropy analysis of n-grams and estimation of the number of meaningful language texts

2021

The construction and analysis of the Russian language models for a cryptographic algorithm research

Malashina

Los

2022

При исследовании криптографических качеств алгоритмов защиты информации важным моментом является построение теоретических и экспериментальных моделей источников сообщений. В данной статье проводится статистический анализ свойств лексических и n-граммных моделей русского языка на основе новостного текстового корпуса. Создан специализированный корпус из новостных статей последних лет политической направленности, отражающий узкую область употребления языка. Составлены словари токенов и n-грамм, найдены величины покрытия этих словарей, а также значения энтропии. Проведена лемматизация исходного текстового корпуса и экстраполяция роста объёма словарей в зависимости от увеличения размера корпуса.Ключевые слова: словари n-грамм, энтропия n-грамм, осмысленные тексты.Библиография: 15 названий.

Mathematical model for recovering discrete parts of a text message

2021

Preprint

The paper studies the procedure for restoring discreet segments of an unknown source message based on information about possible variants of each sign. An algorithm is proposed based on compiling dictionaries of appropriate lengths, searching for text sections with a total number of character variants that do not exceed a given boundary, and then iterating through and eliminating false variants of dictionary values. Statistical properties of short-length text dictionaries are investigated, and extrapolation estimates are made for long-length texts. The main mathematical properties of this algorithm are described. Theoretical studies of the effectiveness of the procedure under consideration are carried out within the framework of a certain probability-theoretical model.