2008 Third International Conference on Digital Information Management 2008
DOI: 10.1109/icdim.2008.4746732
|View full text |Cite
|
Sign up to set email alerts
|

A new technique for detecting similar documents based on term co-occurrence and conceptual property of the text

Abstract: The importance of detecting similar documents grows rapidly as the amount of information increases exponentially. This paper presents a new technique for identifying similar documents. It combines statistical properties of documents with Persian linguistic features. The proposed technique is mostly suited for detecting similar documents in specific fields. The proposed method is built on lexical chain of important words and based on term co-occurrence property of the text. It prevents the irrelevant documents … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
6
0
1

Year Published

2010
2010
2018
2018

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 15 publications
(14 reference statements)
0
6
0
1
Order By: Relevance
“…In this case, the document is represented as a vector of words and the frequency of each word in each document is calculated. If words are chosen as terms, then every word in the vocabulary becomes an independent dimension in a very high dimensional vector space [4,12,13]. The similarity between two documents is determined by a similarity measure such as cosine similarity [4,13] between their corresponding vectors (since cosine has the nice property that it is 1.0 for identical vectors and 0.0 for orthogonal vectors).…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…In this case, the document is represented as a vector of words and the frequency of each word in each document is calculated. If words are chosen as terms, then every word in the vocabulary becomes an independent dimension in a very high dimensional vector space [4,12,13]. The similarity between two documents is determined by a similarity measure such as cosine similarity [4,13] between their corresponding vectors (since cosine has the nice property that it is 1.0 for identical vectors and 0.0 for orthogonal vectors).…”
Section: Related Workmentioning
confidence: 99%
“…It is computationally inefficient due to the sparse sentence vector. The other drawback is that texts with similar meaning do not necessarily share many words [5,12].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Otro enfoque básico para el cálculo de la similitud de documentos es el método de co-ocurrencia que ya está aplicado para la lengua persa por el trabajo de Zamanifar (Zamanifar, et al, 2008). Este método tiene tres pasos principales: la identificación de tema, la interpretación de tema y la medida de similitud.…”
Section: Detección De Documentos Similares En La Ri En Persaunclassified