2016
DOI: 10.1103/physrevx.6.021009
|View full text |Cite
|
Sign up to set email alerts
|

Similarity of Symbol Frequency Distributions with Heavy Tails

Abstract: Quantifying the similarity between symbolic sequences is a traditional problem in information theory which requires comparing the frequencies of symbols in different sequences. In numerous modern applications, ranging from DNA over music to texts, the distribution of symbol frequencies is characterized by heavy-tailed distributions (e.g., Zipf's law). The large number of low-frequency symbols in these distributions poses major difficulties to the estimation of the similarity between sequences; e.g., they hinde… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
72
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 32 publications
(74 citation statements)
references
References 35 publications
2
72
0
Order By: Relevance
“…For example, several recent studies engage in establishing information-theoretic and corpus-based methods for linguistic typology, i.e., classifying and comparing languages according to their information encoding potential [10,[12][13][14][15][16], and how this potential evolves over time [17][18][19]. Similar methods have been applied to compare and distinguish non-linguistic sequences from written language [20,21], though it is controversial whether this helps with more fine-grained distinctions between symbolic systems and written language [22,23].…”
Section: Introductionmentioning
confidence: 99%
“…For example, several recent studies engage in establishing information-theoretic and corpus-based methods for linguistic typology, i.e., classifying and comparing languages according to their information encoding potential [10,[12][13][14][15][16], and how this potential evolves over time [17][18][19]. Similar methods have been applied to compare and distinguish non-linguistic sequences from written language [20,21], though it is controversial whether this helps with more fine-grained distinctions between symbolic systems and written language [22,23].…”
Section: Introductionmentioning
confidence: 99%
“…10. The Jensen-Shannon distance is a quantity to measure the similarity of different series, and is useful in complex system field [40,41]. For two networks with Laplacian matrix eigenvalue sequences P (x) and Q(x), respectively, which are serialized by Gaussian kernel g(x, γ) = 1 √ 2πσ 2 exp(− (x−γ) 2 2σ 2 ), the Jensen-Shannon distance can be expressed as:…”
Section: Multilayer Network Of Wmtnmentioning
confidence: 99%
“…For example, several recent studies engage in establishing information-theoretic and corpus-based methods for linguistic typology, i.e. classifying and comparing languages according to their information encoding potential [10,[12][13][14][15][16], and how this potential evolves over time [17][18][19]. Similar methods have been applied to compare and distinguish non-linguistic sequences from written 2 of 34 language [20,21], though it is controversial whether this helps with more fine-grained distinctions between symbolic systems and written language [22,23].…”
Section: Introductionmentioning
confidence: 99%