Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Pinto, David; Benedí, José-Miguel; Rosso, Paolo

doi:10.1007/978-3-540-70939-8_54

Cited by 35 publications

(35 citation statements)

References 20 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, the EasyAbstracts corpus with scientific abstracts on well differentiated topics can be considered a medium complexity corpus but the CICLing-2002 corpus with narrow domain abstracts is a relatively high complexity corpus. This corpus, generated with abstracts of articles presented at the CICLing 2002 conference 10 is a well-known shorttext corpus that has been recognized in different works [1,5,10,24,26,34,37] as a very difficult corpus. The next three small corpora are subsets of the well known R8-Test corpus, a subcollection of the Reuters-21578 dataset.…”

Section: Short-text Corporamentioning

confidence: 99%

An efficient Particle Swarm Optimization approach to cluster short texts

Cagnina

Errecalde

Ingaramo

et al. 2014

Information Sciences

Self Cite

View full text Add to dashboard Cite

Short texts such as evaluations of commercial products, news, FAQ's and scientific abstracts are important resources on the Web due to the constant requirements of people to use this on line information in real life. In this context, the clustering of short texts is a significant analysis task and a discrete Particle Swarm Optimization (PSO) algorithm named CLUDIPSO has recently shown a promising performance in this type of problems. CLUDIPSO obtained high quality results with small corpora although, with larger corpora, a significant deterioration of performance was observed. This article presents CLUDIPSO ⋆ , an improved version of CLUDIPSO, which includes a different representation of particles, a more efficient evaluation of the function to be optimized and some modifications in the mutation operator. Experimental results with corpora containing scientific abstracts, news and short legal documents obtained from the Web, show that CLUDIPSO ⋆ is an effective clustering method for short-text corpora of small and medium size.

show abstract

Section: Short-text Corporamentioning

confidence: 99%

An efficient Particle Swarm Optimization approach to cluster short texts

Cagnina

Errecalde

Ingaramo

et al. 2014

Information Sciences

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this paper we consider dissimilarity between two text to be one minus the similarity between them. We used four different corpus based similarity methods to create the matrix namely cosine similarity (CS) measure using tf-idf weights, Latent Semantic Analysis (LSA) using log(tf)-idf weights [6], Short text Vector Space Model (SVSM) [7], and Kullback-leibler distance (KLD) [8]. These measures are used by each of the clustering methods.…”

Section: Clustering Methodsmentioning

confidence: 99%

“…Kullback-Leibler Distance (KLD) : KLD is used in [8] to cluster narrow domain abstracts and is based on Kullback-Leibler (KL) divergence which is used to give a value to the difference between two distributions. For two distributions P and Q the KL divergence on a finite set X is shown in (2).…”

Section: Latent Semantic Analysis (Lsa)mentioning

confidence: 99%

Clustering Short Text and Its Evaluation

Shrestha

Jacquin

Daille

2012

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

International audienceRecently there has been an increase in interest towards clustering short text because it could be used in many NLP applications. According to the application, a variety of short text could be defined mainly in terms of their length (e.g. sentence, paragraphs) and type (e.g. scientific papers, newspapers). Finding a clustering method that is able to cluster short text in general is difficult. In this paper, we cluster 4 different corpora with different types of text with varying length and evaluate them against the gold standard. Based on these clustering experiments, we show how different similarity measures, clustering algorithms, and cluster evaluation methods effect the resulting clusters. We discuss four existing corpus based similarity methods, Cosine similarity, Latent Semantic Analysis, Short text Vector Space Model, and Kullback-Leibler distance, four well known clustering methods, Complete Link, Single Link, Average Link hierarchical clustering and Spectral clustering, and three evaluation methods, clustering F-measure, adjusted Rand Index, and V. Our experiments show that corpus based similarity measures do not significantly affect the clusters and that the performance of spectral clustering is better than hierarchical clustering. We also show that the values given by the evaluation methods do not always represent the usability of the clusters

show abstract

“…With the exception of CICling-2002 collection which has already been used in previous works [11], [12], [13], the remaining two corpora were artificially generated with the goal of obtaining corpora with different levels of complexity respect to the length of documents and vocabulary overlapping. Our intention was that in each corpora the similarity measure has different levels of complexity for detecting the conceptual proximity between documents.…”

Section: Data Setsmentioning

confidence: 99%

“…1 The CICling-2002 corpus with relatively high complexity was also used in our work. This collection is considered to be harder to cluster than the previous corpora since its documents are narrow domain abstracts (see [13] for a more detailed description of the corpus).…”

Section: Data Setsmentioning

confidence: 99%

Proximity Estimation and Hardness of Short-Text Corpora

Errecalde

Ingaramo

Rosso

2008

2008 19th International Conference on Database and Expert Systems Applications

Self Cite

View full text Add to dashboard Cite

Abstract-In this work, we investigate the relative hardness of shorttext corpora in clustering problems and how this hardness relates to traditional similarity measures. Our approach basically attempts to establish a connection between the hardness of a corpus and the precision level exhibited by similarity measures, according to the results obtained with different cluster validity measures on the "ideal" clustering of each corpus. Moreover, we also propose a new validity measure, named contiguity error that allowed us to observe this connection in a consistent way in all the collections considered.

show abstract

Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Cited by 35 publications

References 20 publications

An efficient Particle Swarm Optimization approach to cluster short texts

An efficient Particle Swarm Optimization approach to cluster short texts

Clustering Short Text and Its Evaluation

Proximity Estimation and Hardness of Short-Text Corpora

Contact Info

Product

Resources

About