Document Clustering based on Topic Maps

Rafi, Muhammad; Shaikh, M. Shahid; Farooq, Amir

doi:10.5120/1640-2204

Cited by 10 publications

(13 citation statements)

References 13 publications

(8 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Increased cluster purity clearly establishes the fact that the features extracted from the three representations capture the semantics of the documents. The three approaches FIHC [6], CFWS [10] and TMHC [12] produced F-measure for the data sets (See Table IV The proposed approach clearly had shown improvement in most of test cases. This is due to the fact that the multiple representations of documents in the collection capture the semantics in a better way, and are able to produce high FMeasure which is an indication of balance precision and recall (See Figure 6).…”

Section: Resultsmentioning

confidence: 99%

“…It also introduced a novel similarity measure based on common features of the two corresponding graphs of the documents. One more recent approach to capture semantic representation of documents in document representation model is introduced in [12] in which the authors proposed a topic maps based representation by using an online tool Wandora for extracting topics from a document. They also reported encouraging results for document clustering based on semantic notions.…”

Section: The Literature Reviewmentioning

confidence: 99%

“…The approach in [6] proposed a frequent item set-based representation of documents for clustering (FIHC), the second is from [10] from where we only compare with frequent word sequences (CFWS), and third and final is from [12] where authors used topic maps based representation of documents. We have implemented the proposed approaches as described in [6,10,12].…”

Section: Comparative Workmentioning

confidence: 99%

See 2 more Smart Citations

Exploiting Document Level Semantics in Document Clustering

Rafi¹,

Sharif²,

Arshad³

et al. 2016

ijacsa

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Resultsmentioning

confidence: 99%

Section: The Literature Reviewmentioning

confidence: 99%

Section: Comparative Workmentioning

confidence: 99%

See 1 more Smart Citation

Exploiting Document Level Semantics in Document Clustering

Rafi¹,

Sharif²,

Arshad³

et al. 2016

ijacsa

Self Cite

View full text Add to dashboard Cite

show abstract

“…Ahmadi et al [14] proved that topic model based clustering methods generally achieve better results than only applying traditional clustering algorithms like the K-means. LDA has been used in many papers for representation and dimensionality reduction of text documents, as well as for uncovering semantic relations in the text [15]. Ma et al [16] used LDA for document representation and identification of the most significant topics, the K-means++ algorithm was used to define the initial centers of the clusters and the K-means algorithm was used to form the final clusters.…”

Section: Topic Modeling In Document Clusteringmentioning

confidence: 99%

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

2020

View full text Add to dashboard Cite

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users' queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters' connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.

show abstract

“…An alternate approach was taken by Rafi et al [18] who introduced a new document representation model based on the compact topic maps that are present in a document.…”

Section: Cluster Labelsmentioning

confidence: 99%

Document clustering with evolved search queries

Hirsch

Nuovo

2017

2017 IEEE Congress on Evolutionary Computation (CEC)

View full text Add to dashboard Cite

Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.

show abstract

Document Clustering based on Topic Maps

Cited by 10 publications

References 13 publications

Exploiting Document Level Semantics in Document Clustering

Exploiting Document Level Semantics in Document Clustering

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

Document clustering with evolved search queries

Contact Info

Product

Resources

About