A Self-enriching Methodology for Clustering Narrow Domain Short Texts

Pinto, David; Rosso, Paolo; Jiménez-Salazar, Héctor

doi:10.1093/comjnl/bxq069

Cited by 30 publications

(19 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our approach is composed of an expansion procedure which is an adaptation of the SelfTerm Expansion Methodology (S-TEM) [32], which is followed by the application of the Latent Dirichlet Allocation model (LDA) [9] that feeds into the prototype/topic based clustering process.…”

Section: Prototype/topic Based Clustering Methodologymentioning

confidence: 99%

“…These are undesirable characteristics from a clustering perspective, as typically insufficient discriminative information is provided. In order to improve these particular characteristics of weblogs, we employ an enrichment method named the Self-Term Expansion Methodology [32] that does not use external resources, relying only on information included in the corpus itself. We demonstrate that the application of this methodology can improve the quality of topic clusters, and further that the improvement will be more significant where the corpus is composed of well-delimited categories which share a low percentage of vocabulary (i.e., a wide domain corpus).…”

mentioning

confidence: 99%

“…In this work, we empirically established a value greater than 2 to be the best threshold. In other experiments we have conducted using more formal texts [32], a threshold of 6 was used;…”

mentioning

confidence: 99%

See 2 more Smart Citations

Prototype/topic based clustering method for weblogs

Perez-Tellez

Cardiff

Rosso

et al. 2016

IDA

View full text Add to dashboard Cite

Abstract. In the last 10 years, the information generated on weblog sites has increased exponentially, resulting in a clear need for intelligent approaches to analyse and organise this massive amount of information. In this work, we present a methodology to cluster weblog posts according to the topics discussed therein, which we derive by text analysis. We have called the methodology Prototype/Topic Based Clustering, an approach which is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. The usage of the Self-Term Expansion methodology is to improve the representation of the data and the generative probabilistic model is employed to identify relevant topics discussed in the weblogs. We have modified the generative probabilistic model in order to exploit predefined initialisations of the model and have performed our experiments in narrow and wide domain subsets. The results of our approach have demonstrated a considerable improvement over the pre-defined baseline and alternative state of the art approaches, achieving an improvement of up to 20% in many cases. The experiments were performed on both narrow and wide domain datasets, with the latter showing better improvement. However in both cases, our results outperformed the baseline and state of the art algorithms.

show abstract

Section: Prototype/topic Based Clustering Methodologymentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Prototype/topic based clustering method for weblogs

Perez-Tellez

Cardiff

Rosso

et al. 2016

IDA

View full text Add to dashboard Cite

show abstract

“…al. have used this technique to cluster documents of a corpus with narrow domain and short texts [20].…”

Section: Proposed Methodsmentioning

confidence: 99%

Instance Selection in Text Classification Using the Silhouette Coefficient Measure

Dey

Solorio

Gómez

2011

Advances in Artificial Intelligence

View full text Add to dashboard Cite

Abstract. The paper proposes the use of the Silhouette Coefficient (SC) as a ranking measure to perform instance selection in text classification. Our selection criterion was to keep instances with mid-range SC values while removing the instances with high and low SC values. We evaluated our hypothesis across three well-known datasets and various machine learning algorithms. The results show that our method helps to achieve the best trade-off between classification accuracy and training time.

show abstract

“…Although the idea of term expansion has been previously studied in literature (Banerjee & Pedersen, 2002) (Pinto, Rosso & Jimenez-Salazar, 2010) we are not aware of works in which it is applied to microblog texts.…”

Section: Clustering the Tweet Datasetmentioning

confidence: 99%

Disambiguating company names in microblog text using clustering for online reputation management

Pérez-Téllez¹,

Cardiff²,

Rosso³

et al. 2015

Rev. signos

View full text Add to dashboard Cite

Twitter is used by millions of users to publish brief messages (tweets) with the purpose of sharing experiences and/or opinions about a product or service. There is a clear need for systems that can mine these messages in order to derive information about the collective thinking of twitterers (e.g. for opinion or sentiment analysis). Tweet analysis is a very important task because comments, opinions, suggestions, complaints etc. can be used for marketing strategies or for determining information on a company's reputation. For this purpose, it is necessary to automatically establish whether a tweet refers to a company or not, when the company name is ambiguous. This task is not a straightforward keyword search process as there may be multiple contexts in which a name can be used. The aim of this study is to present and compare four different approaches which improve the representation of short texts for better performance of the clustering task that determine whether a given tweet refers to a particular company INTRODUCTIONTwitter 1 -the microblog platform that allows users to publish brief messages of less than 140 characters-is a Web 2.0 application which offers a new mode of user interaction. It has become an important channel through which users can share their experiences or opinions about a product, service or company, and companies are taking advantage of this medium as part of their marketing strategies. It has been estimated by Complete 2 that the use of Twitter has been drastically increased from 2009 to 2012, reaching up to 45 million unique visitors; however the increase in 2012 was not as significant as in previous years. In 2012 and 2013 Twitter has been or not. For this purpose, we have used a variety of enriching methodologies based on term expansion via the semantic similarity hidden behind the lexical structure, in order to improve the representation of tweets and as a consequence the performance of the task. We have used two different tweet datasets of company names which contain different levels of ambiguity. The results are promising although they highlight the difficulty of this task.Key Words: Clustering of tweets, opinion analysis, disambiguation, online reputation management. ResumenTwitter es utilizado por millones de personas con la finalidad de publicar mensajes cortos con el propósito de compartir experiencias y/u opiniones acerca de un determinado producto o servicio. Existe una clara necesidad de crear sistemas que sean capaces de analizar estos mensajes a fin de derivar información sobre el pensamiento colectivo de las personas que los publican. El análisis de los tweets se ha convertido en una tarea muy importante para las grandes compañías, debido a que los comentarios, sugerencias y quejas pueden ser usados como estrategias de mercadotecnia o para determinar la reputación de cierta compañía. Entre otras tareas, es necesario construir métodos que permitan determinar, de forma automática, cuando un tweet se refiere a una compañía o no, en el caso de que el nombre de la co...

show abstract

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

Cited by 30 publications

References 44 publications

Prototype/topic based clustering method for weblogs

Prototype/topic based clustering method for weblogs

Instance Selection in Text Classification Using the Silhouette Coefficient Measure

Disambiguating company names in microblog text using clustering for online reputation management

Contact Info

Product

Resources

About