Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm

Casillas, Arantza; Lena, Mayte T. González de; Martínez, Rosa

doi:10.1007/978-3-540-39398-6_7

Cited by 43 publications

(35 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Other work has taken into consideration the use of genetic algorithm for cluster analysis of documents. Casillas et al (2003) presenteda genetic algorithm that clusters documents Science Publications JCS intounidentifiedquantity of clusters. Premalatha and Natarajan (2009) proposed a method for document clustering based on genetic algorithm with Simultaneous mutation operator and ranked mutation rate.…”

Section: Related Workmentioning

confidence: 99%

Clustering Tweets Using Cellular Genetic Algorithm

Adel¹

2014

Journal of Computer Science

View full text Add to dashboard Cite

As the popularity of Twitter continues to increase rapidly, it is extremely necessary to analyze the huge amount of data that Twitter users generate. A popular method of tweet analysis is clustering. Because most tweets are textual, this study focuses on clustering tweets based on their textual content similarity. This study presents tweet clustering using cellular genetic algorithm cGA. The results obtained by cGA are compared with those obtained by generational genetic algorithm in terms of average fitness, average time required for execution and number of generations. Experimental results are tested with two sets: One of 1000 tweets and the second formed of 5000 tweets. The results show a nearly equal performance for both algorithms in terms of the average fitness of the solution. On the other hand, cGA shows a much faster performance than generational. These results demonstrate that cellular genetic algorithm outperforms generational genetic algorithm in tweet clustering.

show abstract

Section: Related Workmentioning

confidence: 99%

Clustering Tweets Using Cellular Genetic Algorithm

Adel¹

2014

Journal of Computer Science

View full text Add to dashboard Cite

show abstract

“…This criterion showed the best performance in the experiments by Milligan and Cooper (1985), and was subsequently utilized by some authors for choosing the number of clusters (for example, Casillas et al 2003).…”

Section: Variance Based Approachmentioning

confidence: 99%

“…For example, Casillas et al (2003) utilize the Minimum spanning tree which is split into a number of clusters with a genetic algorithm to meet an arbitrary stopping condition. Six different agglomerative algorithms are applied to the same data by Chae et al (2006), and the number of clusters at which these partitions are most similar is selected.…”

Section: Hierarchical Clustering Approachesmentioning

confidence: 99%

“…Sizes. First of all, the quantitative parameters of the generated data and cluster structure are specified: the number of entities N, the number of generated clusters K*, and the number of variables M. In most publications, these are kept relatively small: N ranges from about 50 to 200, M is in many cases 2 and, anyway, not greater than 10, and K* is of the order of 3, 4 or 5 (see, for example, Casillas et al 2003, Chae et al 2006, Hand and Krzhanowski 2005, Hardy 1996, Kuncheva and Petrov 2005, McLachlan and Khan 2004, Milligan and Cooper 1985. Larger sizes appear in Feng and Hamerly (2006) (N= 4000, M is up to 16 and K*=20) and Steinley and Brusco (2007) (N is up to 5000, M=25, 50 and 125, and K* =5, 10, 20).…”

Section: Data and Cluster Structure Parametersmentioning

confidence: 99%

See 1 more Smart Citation

Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads

2010

View full text Add to dashboard Cite

The issue of determining "the right number of clusters" in K-Means has attracted considerable interest, especially in the recent years. Cluster overlap appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between-and within-cluster spread to model different cluster overlaps. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the "intelligent" K-Means method, ik-Means, that find the right number of clusters one-by-one extracting "anomalous patterns" from the data. We compare them with seven other methods, including Hartigan's rule, averaged Silhouette width and Gap statistic, under six different between-and within-cluster spreadshape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan's rule -but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.

show abstract

“…The worthiness of Genetic Algorithm based clustering has been realized in various application scenarios like production simulation. [41], microarray data analysis [42], clustering small regions in colors feature space [52], image compression problem [34], document clustering [8], text clustering. [55], mobile ad hoc networks [50] and gene ontology [44] etc.…”

Section: Clustering Algorithmmentioning

confidence: 99%