Zipf's law has intrigued people for a long time. This distribution models a certain type of statistical regularity observed in a text. George K. Zipf showed that, if a word is characterised by its frequency, then, rank and frequency are not independent and approximately verify the relationship:Rank  frequency % constant Various explanations have been advanced to explain this law. In this article, we talk about the Mandelbrot process, which includes two very different approaches. In the first approach, Mandelbrot studies language generation as the transmission of a signal and bases it on information theory, using the entropy concept. In the second, geometric approach, he draws a parallel with the fractal theory, where each word of the text is a sequence of characters framed by two separators, meaning a simple geometric pattern. This leads us to hypothesise that, since the statistical regularities observed have several possible explanations, Zipf's law carries other patterns. To verify this hypothesis, we chose a text, which we modified and degraded in several successive stages. We called T i the text degraded at step i. We then segmented T i into words. We found that rank and frequency were not independent and approximately verified the relationship:The coefficient b i increases with each step i. We call Eq. (1) the generalized Zipf law. We found statistical regularities in the deconstruction of the text. We notably observed a linear relationship between the entropy H i and the amount of effort E i of the various degraded texts T i . To verify our assumptions, we degraded a text of approximately 200 pages. At each step, we calculated various parameters such as entropy, the amount of effort, and the
Natural language processing raises the problem of ambiguities and multiple solutions which follow frnm them. The knowledge gained when using the morphosyutactic atmlyser CRISSTAL showed how necessary it was to overcome this issue. The architecture with sequential levels, in which each module corresponds to a linguistic level (pretreatments, morphology, syntax, semantics) has shown its limits. A sequential architecture does not allow a real exchange between different modules. This le~als to the non availability of the linguistic information for the reduction of ambiguities, at the moment they are needed. The necessity for cooperation between different modules has lead us to envisage a new architecture which stems from the techniques of distributed artificial intelligence. Mots-cl6sEnvironnement d'int6gratiou d'outils linguistiques, langue naturelle, franqais 6crit, intelligence artificielle distribu6e, syst~mes multi-agents, syst~mes gouvem6s par des lois, protocole de communication.ACRES DE COLING-92, NANTES, 23-28 AOr.3"r 1992 4 9 0 PROC. OF COLING-92, NANTES. AUC. 23-28, 1992
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.