2002
DOI: 10.1209/epl/i2002-00528-3
|View full text |Cite
|
Sign up to set email alerts
|

Keyword detection in natural languages and DNA

Abstract: We show that words in a text present long-range frequency fluctuations due to a strong self-attraction, that is directly related to the relevance of the term to the text considered. The standard deviation of the distance between successive occurrences of a word is an excellent parameter to quantify this self-attraction, and provides us with an effective tool for automatic keyword extraction. DNA sequences also present the same features: “words”, for example codons in the coding part of the sequences, attract b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
109
1
3

Year Published

2010
2010
2019
2019

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 107 publications
(114 citation statements)
references
References 19 publications
(11 reference statements)
1
109
1
3
Order By: Relevance
“…The representative of content words is "WHALE" which is the 23rd ranked word but the most common noun. The main conclusion in [35,36] is confirmed that function words such as THE tend to be evenly scattered whereas content words such as WHALE tends to be clustered, leaving huge gaps between clusters (and the low-rank-number spacings are much larger). When a power-law function is applied to fit the inter-word spacings, the fitting performance is not good.…”
Section: Ranked Inter-word Spacing Distributionmentioning
confidence: 89%
See 2 more Smart Citations
“…The representative of content words is "WHALE" which is the 23rd ranked word but the most common noun. The main conclusion in [35,36] is confirmed that function words such as THE tend to be evenly scattered whereas content words such as WHALE tends to be clustered, leaving huge gaps between clusters (and the low-rank-number spacings are much larger). When a power-law function is applied to fit the inter-word spacings, the fitting performance is not good.…”
Section: Ranked Inter-word Spacing Distributionmentioning
confidence: 89%
“…Motivated by level statistics in quantum disorder systems, it has been proposed that distance between successive occurrences of the same word might be related to whether or not that word plays an important role in the text (the so-called "keyword") [35,36]. Similar studies of gap distributions are also common in bioinformatics, such as the in-frame start-to-stop codon distances (which defines "open reading frames") [37,38] or in-frame stop-stop codon distances [39].…”
Section: Ranked Inter-word Spacing Distributionmentioning
confidence: 99%
See 1 more Smart Citation
“…It has been widely employed to detect keywords in texts as an alternative to the tf-idf technique [53]. In addition, the intermittency has proven relevant to detect keywords in genetic sequences [32]. A qualitative comparison of words taking distinct values of intermittency is provided in figure 2, which shows the distribution of the words 'Carmylle' (f i = 54) and 'feel' (f i = 54) in the book 'Adventures of Sally', by Pelham Grenville Wodehouse.…”
Section: Intermittencymentioning
confidence: 99%
“…These properties are dependent on the parts of speech (POS) of the phrase constituents [2]. Few of the major linguistic patterns for a phrase in English are: [5]. He argues that the standard deviation of the distance between successive occurrences of a word is such a parameter to quantify this self-attraction.…”
Section: Some Linguistic Properties Of Keyphrasesmentioning
confidence: 99%