On the origin of long-range correlations in texts

Altmann, Eduardo G.; Cristadoro, Giampaolo; Esposti, Mirko Degli

doi:10.1073/pnas.1117723109

Cited by 95 publications

(87 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…When applied to natural language, our results show that the temporal organization of natural languages (with some differences between them) exhibits more complex structure than the sequences constructed by randomizations. These results are also concordant with previous studies, which report the presence of long-range correlations in written texts [33,34]. …”

Section: Multiscale Entropy Analysis Of Textssupporting

confidence: 83%

Evaluating the Irregularity of Natural Languages

Hernández-Gómez

Basurto-Flores

Obregón-Quintana

et al. 2017

Entropy

View full text Add to dashboard Cite

Abstract:In the present work, we quantify the irregularity of different European languages belonging to four linguistic families (Romance, Germanic, Uralic and Slavic) and an artificial language (Esperanto). We modified a well-known method to calculate the approximate and sample entropy of written texts. We find differences in the degree of irregularity between the families and our method, which is based on the search of regularities in a sequence of symbols, and consistently distinguishes between natural and synthetic randomized texts. Moreover, we extended our study to the case where multiple scales are accounted for, such as the multiscale entropy analysis. Our results revealed that real texts have non-trivial structure compared to the ones obtained from randomization procedures.

show abstract

Section: Multiscale Entropy Analysis Of Textssupporting

confidence: 83%

Evaluating the Irregularity of Natural Languages

Hernández-Gómez

Basurto-Flores

Obregón-Quintana

et al. 2017

Entropy

View full text Add to dashboard Cite

show abstract

“…However, real sentences are not random sequences of words, as research on long correlations in physics has been showing for more than a decade, e.g. [61,62]. Second, although such null model predictor has been tested previously on uniformly random trees [28], one cannot assume the predictor will work on real sentences given the substantial statistical differences between uniformly random trees and real syntactic dependency trees [18].…”

Section: Random Linear Arrangement With Some Knowledge About Depenmentioning

confidence: 99%

Scarcity of crossing dependencies: A direct outcome of a specific constraint?

Gómez-Rodríguez

Ferrer-i-Cancho

2017

Phys. Rev. E

View full text Add to dashboard Cite

The structure of a sentence can be represented as a network where vertices are words and edges indicate syntactic dependencies. Interestingly, crossing syntactic dependencies have been observed to be infrequent in human languages. This leads to the question of whether the scarcity of crossings in languages arises from an independent and specific constraint on crossings. We provide statistical evidence suggesting that this is not the case, as the proportion of dependency crossings of sentences from a wide range of languages can be accurately estimated by a simple predictor based on a null hypothesis on the local probability that two dependencies cross given their lengths. The relative error of this predictor never exceeds 5% on average, whereas the error of a baseline predictor assuming a random ordering of the words of a sentence is at least 6 times greater. Our results suggest that the low frequency of crossings in natural languages is neither originated by hidden knowledge of language nor by the undesirability of crossings per se, but as a mere side effect of the principle of dependency length minimization.

show abstract

“…It has important applications beyond the traditional purview of physics, as well [1][2][3][4][5], including applications to music [4,6], genomics [7,8] and human languages [9][10][11][12].…”

Section: Introductionmentioning

confidence: 99%

Critical Behavior in Physics and Probabilistic Formal Languages

Lin

Tegmark

2017

Entropy

111

View full text Add to dashboard Cite

Abstract:We show that the mutual information between two symbols, as a function of the number of symbols between the two, decays exponentially in any probabilistic regular grammar, but can decay like a power law for a context-free grammar. This result about formal languages is closely related to a well-known result in classical statistical mechanics that there are no phase transitions in dimensions fewer than two. It is also related to the emergence of power law correlations in turbulence and cosmological inflation through recursive generative processes. We elucidate these physics connections and comment on potential applications of our results to machine learning tasks like training artificial recurrent neural networks. Along the way, we introduce a useful quantity, which we dub the rational mutual information, and discuss generalizations of our claims involving more complicated Bayesian networks.

show abstract

On the origin of long-range correlations in texts

Cited by 95 publications

References 36 publications

Evaluating the Irregularity of Natural Languages

Evaluating the Irregularity of Natural Languages

Scarcity of crossing dependencies: A direct outcome of a specific constraint?

Critical Behavior in Physics and Probabilistic Formal Languages

Contact Info

Product

Resources

About