A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study

Xu, Shuting; Zhang, Jun

doi:10.1023/b:supe.0000040611.25862.d9

Cited by 31 publications

(17 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In agglomerative algorithms, each document is initially assigned to a different cluster. The algorithm then repeatedly merges pairs of clusters until a certain stopping criterion is met [51]. Conversely, divisive algorithms repeatedly divide the whole documents into a certain number of clusters, increasing the number of clusters at each step.…”

Section: Introductionmentioning

confidence: 99%

“…Hierarchical clustering algorithms [22,28,38,52] create a hierarchical decomposition of the given dataset which forms dendrograma tree by splitting the dataset recursively into smaller subsets, representing the documents in a multi-level structure [14,21]. The hierarchical algorithms can be further divided into either agglomerative or divisive algorithms [51]. In agglomerative algorithms, each document is initially assigned to a different cluster.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient stochastic algorithms for document clustering

Forsati

Mahdavi

Shamsfard

et al. 2013

Information Sciences

110

View full text Add to dashboard Cite

Clustering has become an increasingly important and highly complicated research area for targeting useful and relevant information in modern application domains such as the World Wide Web. Recent studies have shown that the most commonly used partitioning-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm may generate a local optimal clustering. In this paper, we present novel document clustering algorithms based on the Harmony Search (HS) optimization method. By modeling clustering as an optimization problem, we first propose a pure HS based clustering algorithm that finds near-optimal clusters within a reasonable time. Then, harmony clustering is integrated with the K-means algorithm in three ways to achieve better clustering by combining the explorative power of HS with the refining power of the K-means. Contrary to the localized searching property of K-means algorithm, the proposed algorithms perform a globalized search in the entire solution space. Additionally, the proposed algorithms improve K-means by making it less dependent on the initial parameters such as randomly chosen initial cluster centers, therefore, making it more stable. The behavior of the proposed algorithm is theoretically analyzed by modeling its population variance as a Markov chain. We also conduct an empirical study to determine the impacts of various parameters on the quality of clusters and convergence behavior of the algorithms. In the experiments, we apply the proposed algorithms along with K-means and a Genetic Algorithm (GA) based clustering algorithm on five different document datasets. Experimental results reveal that the proposed algorithms can find better clusters and the quality of clusters is comparable based on F-measure, Entropy, Purity, and Average Distance of Documents to the Cluster Centroid (ADDC).

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Efficient stochastic algorithms for document clustering

Forsati

Mahdavi

Shamsfard

et al. 2013

Information Sciences

110

View full text Add to dashboard Cite

show abstract

“…The k-means algorithm [10] (with its many variants) is a popular clustering method for text and web collections [17,18]. It gained its popularity due to its simplicity and intuition.…”

Section: Clusteringmentioning

confidence: 99%

Enhancing clustering blog documents by utilizing author/reader comments

Zhang

2007

Proceedings of the 45th Annual Southeast Regional Conference

Self Cite

View full text Add to dashboard Cite

Blogs are a new form of internet phenomenon and a vast everincreasing information resource. Mining blog files for information is a very new research direction in data mining. Blog files are different from standard web files and may need specialized mining strategies. We propose to include the title, body, and comments of the blog pages in clustering datasets from blog documents. In particular, we argue that the author/reader comments of the blog pages may have more discriminating effect in clustering blog documents. We constructed a word-page matrix by downloading blog pages from a well-known website and experimented a k-means clustering algorithm with different weights assigned to the title, body, and comment parts. Our experimental results show that assigning a larger weight value to the blog comments helps the k-means algorithm produce better clustering solutions. The experimental results confirm our hypothesis that the author/reader comments of the blog files are very useful in discriminating blog files.

show abstract

“…This idea was described in [31] and also resolves the indeterminacy in k-means but is fundamentally different from our proposed k-means steered PDDP variants.…”

Section: An Ordering Of Its Columns Then the Optimal Cut-point For 2mentioning

confidence: 99%

Principal Direction Divisive Partitioning with Kernels and k-Means Steering

Zeimpekis

Gallopoulos

2008

Survey of Text Mining II

View full text Add to dashboard Cite

Summary. Clustering is a fundamental task in data mining. We propose, implement and evaluate several schemes that combine partitioning and hierarchical algorithms, specifically k-means and Principal Direction Divisive Partitioning (PDDP). Using available theory regarding the solution of the clustering indicator vector problem, we use 2-means to induce partitionings around fixed or varying cut-points. 2-means is applied either on the data or over its projection on a one-dimensional subspace. These techniques are also extended to the case of PDDP(l), a multiway clustering algorithm generalizing PDDP. To handle data that does not lend itself to linear separability, the algebraic framework is established for a kernel variant, KPDDP. Extensive experiments demonstrate the performance of the above methods and suggest that it is advantageous to steer PDDP using k-means. It is also shown that KPDDP can provide results of superior quality than kernel k-means.

show abstract

A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study

Cited by 31 publications

References 17 publications

Efficient stochastic algorithms for document clustering

Efficient stochastic algorithms for document clustering

Enhancing clustering blog documents by utilizing author/reader comments

Principal Direction Divisive Partitioning with Kernels and k-Means Steering

Contact Info

Product

Resources

About