A new technique for detecting similar documents based on term co-occurrence and conceptual property of the text

Zamanifar, Azadeh; Minaei-Bidgoli, Behrouz; Kashefi, Omid

doi:10.1109/icdim.2008.4746732

Cited by 5 publications

(7 citation statements)

References 15 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In this case, the document is represented as a vector of words and the frequency of each word in each document is calculated. If words are chosen as terms, then every word in the vocabulary becomes an independent dimension in a very high dimensional vector space [4,12,13]. The similarity between two documents is determined by a similarity measure such as cosine similarity [4,13] between their corresponding vectors (since cosine has the nice property that it is 1.0 for identical vectors and 0.0 for orthogonal vectors).…”

Section: Related Workmentioning

confidence: 99%

“…It is computationally inefficient due to the sparse sentence vector. The other drawback is that texts with similar meaning do not necessarily share many words [5,12].…”

Section: Related Workmentioning

confidence: 99%

“…Another important word-based method is Latent Semantic Indexing (LSI) [14] that focuses on the cooccurrence of the term, but it is inefficient both in time and accuracy, because of the lack of considering other relations between terms [12,14]. This results in creating relations between terms which are not real.…”

Section: Related Workmentioning

confidence: 99%

“…This results in creating relations between terms which are not real. It also does not take into account the order of the words [12,14].…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Optimizing Document Similarity Detection in Persian Information Retrieval

Kashefi¹,

Mohseni²,

Minaei³

2010

JCIT

View full text Add to dashboard Cite

Most data on the Web is in the form of text or image. Finding desired data on the Web in a timely and cost-effective way is a problem of wide interest. In the last several years, many search engines have been created to help Web users find desired information. In this paper we present a new technique to eliminate the affixes and their effects on recognizing similar Persian documents. Reviewing affixes' rules and exceptions in Persian language, we extracted about 300 common inflectional suffixes and their combinations. We evaluate the effectiveness of eliminating the affixes from Persian texts on document similarity using four major document similarity approaches: Latent Semantic Indexing, Shingling, Vector Space Model, and Co-occurrence. Evaluation results demonstrate improvement in retrieval and detection of similar documents after eliminating affixes.

show abstract

Section: Related Workmentioning

confidence: 99%

“…It is computationally inefficient due to the sparse sentence vector. The other drawback is that texts with similar meaning do not necessarily share many words [5,12].…”

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…This results in creating relations between terms which are not real. It also does not take into account the order of the words [12,14].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Optimizing Document Similarity Detection in Persian Information Retrieval

Kashefi¹,

Mohseni²,

Minaei³

2010

JCIT

View full text Add to dashboard Cite

show abstract

“…Otro enfoque básico para el cálculo de la similitud de documentos es el método de co-ocurrencia que ya está aplicado para la lengua persa por el trabajo de Zamanifar (Zamanifar, et al, 2008). Este método tiene tres pasos principales: la identificación de tema, la interpretación de tema y la medida de similitud.…”

Section: Detección De Documentos Similares En La Ri En Persaunclassified