The aim of document clustering is to produce coherent clusters of similar documents. Clustering algorithms rely on text normalisation techniques to represent and cluster documents. Although most document clustering algorithms perform well in specific knowledge domains, processing cross-domain document repositories is still a challenge. This paper attempts to address this challenge. It investigates the performance of the sk-means clustering algorithm across domains, by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 different generic sub-domains of a thousand documents, each randomly selected from the Reuters21578 corpus. The experimental results obtained from the evaluation demonstrate improved coherence of clusters produced by using a semantically enhanced text stemmer (SETS), when compared to the text normalisation obtained with the Porter stemmer. In addition, semantic-based text normalisation is shown to be resistant to noise, which is often introduced in the index aggregation stage, a stage that acquires features to represent documents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.