2007
DOI: 10.1007/978-3-540-70939-8_54
|View full text |Cite
|
Sign up to set email alerts
|

Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Abstract: Abstract. Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
34
0

Year Published

2008
2008
2014
2014

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 35 publications
(35 citation statements)
references
References 20 publications
(28 reference statements)
1
34
0
Order By: Relevance
“…Thus, the EasyAbstracts corpus with scientific abstracts on well differentiated topics can be considered a medium complexity corpus but the CICLing-2002 corpus with narrow domain abstracts is a relatively high complexity corpus. This corpus, generated with abstracts of articles presented at the CICLing 2002 conference 10 is a well-known shorttext corpus that has been recognized in different works [1,5,10,24,26,34,37] as a very difficult corpus. The next three small corpora are subsets of the well known R8-Test corpus, a subcollection of the Reuters-21578 dataset.…”
Section: Short-text Corporamentioning
confidence: 99%
“…Thus, the EasyAbstracts corpus with scientific abstracts on well differentiated topics can be considered a medium complexity corpus but the CICLing-2002 corpus with narrow domain abstracts is a relatively high complexity corpus. This corpus, generated with abstracts of articles presented at the CICLing 2002 conference 10 is a well-known shorttext corpus that has been recognized in different works [1,5,10,24,26,34,37] as a very difficult corpus. The next three small corpora are subsets of the well known R8-Test corpus, a subcollection of the Reuters-21578 dataset.…”
Section: Short-text Corporamentioning
confidence: 99%
“…In this paper we consider dissimilarity between two text to be one minus the similarity between them. We used four different corpus based similarity methods to create the matrix namely cosine similarity (CS) measure using tf-idf weights, Latent Semantic Analysis (LSA) using log(tf)-idf weights [6], Short text Vector Space Model (SVSM) [7], and Kullback-leibler distance (KLD) [8]. These measures are used by each of the clustering methods.…”
Section: Clustering Methodsmentioning
confidence: 99%
“…Kullback-Leibler Distance (KLD) : KLD is used in [8] to cluster narrow domain abstracts and is based on Kullback-Leibler (KL) divergence which is used to give a value to the difference between two distributions. For two distributions P and Q the KL divergence on a finite set X is shown in (2).…”
Section: Latent Semantic Analysis (Lsa)mentioning
confidence: 99%
“…With the exception of CICling-2002 collection which has already been used in previous works [11], [12], [13], the remaining two corpora were artificially generated with the goal of obtaining corpora with different levels of complexity respect to the length of documents and vocabulary overlapping. Our intention was that in each corpora the similarity measure has different levels of complexity for detecting the conceptual proximity between documents.…”
Section: Data Setsmentioning
confidence: 99%
“…1 The CICling-2002 corpus with relatively high complexity was also used in our work. This collection is considered to be harder to cluster than the previous corpora since its documents are narrow domain abstracts (see [13] for a more detailed description of the corpus).…”
Section: Data Setsmentioning
confidence: 99%