2010
DOI: 10.1093/comjnl/bxq069
|View full text |Cite
|
Sign up to set email alerts
|

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

Abstract: Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a selfterm expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
19
0

Year Published

2011
2011
2020
2020

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 30 publications
(19 citation statements)
references
References 44 publications
0
19
0
Order By: Relevance
“…Our approach is composed of an expansion procedure which is an adaptation of the SelfTerm Expansion Methodology (S-TEM) [32], which is followed by the application of the Latent Dirichlet Allocation model (LDA) [9] that feeds into the prototype/topic based clustering process.…”
Section: Prototype/topic Based Clustering Methodologymentioning
confidence: 99%
See 2 more Smart Citations
“…Our approach is composed of an expansion procedure which is an adaptation of the SelfTerm Expansion Methodology (S-TEM) [32], which is followed by the application of the Latent Dirichlet Allocation model (LDA) [9] that feeds into the prototype/topic based clustering process.…”
Section: Prototype/topic Based Clustering Methodologymentioning
confidence: 99%
“…These are undesirable characteristics from a clustering perspective, as typically insufficient discriminative information is provided. In order to improve these particular characteristics of weblogs, we employ an enrichment method named the Self-Term Expansion Methodology [32] that does not use external resources, relying only on information included in the corpus itself. We demonstrate that the application of this methodology can improve the quality of topic clusters, and further that the improvement will be more significant where the corpus is composed of well-delimited categories which share a low percentage of vocabulary (i.e., a wide domain corpus).…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…al. have used this technique to cluster documents of a corpus with narrow domain and short texts [20].…”
Section: Proposed Methodsmentioning
confidence: 99%
“…Although the idea of term expansion has been previously studied in literature (Banerjee & Pedersen, 2002) (Pinto, Rosso & Jimenez-Salazar, 2010) we are not aware of works in which it is applied to microblog texts.…”
Section: Clustering the Tweet Datasetmentioning
confidence: 99%