Document representation and multilevel measures of document similarity

Matveeva, Irina

doi:10.3115/1225797.1225804

Cited by 16 publications

(8 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…12 This version of the algorithm is fast and requires the number of clusters as input. Experiments were performed to split the data into 5,10,15,20,25,30,35,40,45 and 50 clusters. The quality of the clusters was assessed with the silhouette measure proposed by Rousseeuw [3].…”

Section: Evaluation and Discussionmentioning

confidence: 99%

“…The weights of the document indices are calculated by multiplying TF and IDF. However, computing the weight of all words across all documents leads to high computational complexity [29], which motivates considerable interest in low-dimensional document representation that overcomes this particular issues [30]. The SETS algorithm addresses this problem by reducing the dimensions in the document-term matrix, but it still relies on TF-IDF values to measure a word's eliteness.…”

Section: Document Clustering With Spherical K-meansmentioning

confidence: 99%

See 1 more Smart Citation

Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS)

Stankov

Todorov

Setchi

2013

KES

View full text Add to dashboard Cite

The aim of document clustering is to produce coherent clusters of similar documents. Clustering algorithms rely on text normalisation techniques to represent and cluster documents. Although most document clustering algorithms perform well in specific knowledge domains, processing cross-domain document repositories is still a challenge. This paper attempts to address this challenge. It investigates the performance of the sk-means clustering algorithm across domains, by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 different generic sub-domains of a thousand documents, each randomly selected from the Reuters21578 corpus. The experimental results obtained from the evaluation demonstrate improved coherence of clusters produced by using a semantically enhanced text stemmer (SETS), when compared to the text normalisation obtained with the Porter stemmer. In addition, semantic-based text normalisation is shown to be resistant to noise, which is often introduced in the index aggregation stage, a stage that acquires features to represent documents.

show abstract

Section: Evaluation and Discussionmentioning

confidence: 99%

Section: Document Clustering With Spherical K-meansmentioning

confidence: 99%

Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS)

Stankov

Todorov

Setchi

2013

KES

View full text Add to dashboard Cite

show abstract

“…For the traditional methods, Lyon et al [9] proposed a tri-gram and set theory-based algorithm, a data finger-based method, to extract the data finger of sentences and then mapped it into a range of value using Hash or MD5 function, then, reported the similarity according to the overlapped ratio of similar value or the maximum common sub-sequence. Matveeva [10] and Hatzivassiloglou et al [11] presented a Vector Space Model (VSM) algorithm to compute the similarity using Cosine measurement of the vector.…”

Section: Related Workmentioning

confidence: 99%

“…Matveeva [10] and Hatzivassiloglou et al [11] presented a Vector Space Model (VSM) algorithm to compute the similarity using Cosine measurement of the vector. Yih [9] explored different score approaches, not traditional TF-IDF weight, to study the term weight function. Broder [12] explored a shingles-based algorithm to define the containment of two documents and took Jaccard coefficient [13] to represent the similarity of them.…”

Section: Related Workmentioning

confidence: 99%

An Approach of Semantic Similarity Measure between Documents Based on Big Data

Erritali

Beni‐Hssane²,

Birjali³

et al. 2016

IJECE

View full text Add to dashboard Cite

Semantic indexing and document similarity is an important information retrieval system problem in Big Data with broad applications. In this paper, we investigate MapReduce programming model as a specific framework for managing distributed processing in a large of amount documents. Then we study the state of the art of different approaches for computing the similarity of documents. Finally, we propose our approach of semantic similarity measures using WordNet as an external network semantic resource. For evaluation, we compare the proposed approach with other approaches previously presented by using our new MapReduce algorithm. Experimental results review that our proposed approach outperforms the state of the art ones on running time performance and increases the measurement of semantic similarity.

show abstract

“…Very often the positions of words are ignored when performing document clustering. Words, also known as indexing terms, and their weights in documents are usually used as important parameters to compute the similarity of documents [3]. Those documents that contain similar indexing terms and frequencies will be grouped under the same cluster.…”

Section: Introductionmentioning

confidence: 99%

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

Chumwatana¹,

Wong²,

Xie

2010

JILSA

View full text Add to dashboard Cite

This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently

show abstract

Document representation and multilevel measures of document similarity

Cited by 16 publications

References 8 publications

Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS)

Enhanced cross-domain document clustering with a semantically enhanced text stemmer (SETS)

An Approach of Semantic Similarity Measure between Documents Based on Big Data

A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

Contact Info

Product

Resources

About