A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Recupero, Diego Reforgiato

doi:10.1007/s10791-007-9035-7

Cited by 44 publications

(5 citation statements)

References 30 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent work (Hotho et al, 2003;Sedding and Kazakov, 2004;Reforgiato Recupero, 2007), considers not only syntactic information, obtained from the terms present in a document, but also semantic relationships between terms. These approaches are mostly based on WordNet (Fellbaum, 1998), which is a lexical database that groups English words into sets of synonyms, called synsets.…”

Section: Related Workmentioning

confidence: 99%

“…Concerning the clustering algorithm several approaches are followed in the literature. In (Hotho et al, 2003;Sedding and Kazakov, 2004;Reforgiato Recupero, 2007) a variant of the K-means, the Bi-Section-K-means is used, stating that this method frequently outperforms the standard K-means. In (Boyack et al, 2011) a more complex partitioning of the document collection is proposed.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Organisation of Scientific Documents

Lourenço¹,

Medina²,

Fred³

et al. 2011

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval

View full text Add to dashboard Cite

Unsupervised organisation of documents, and in particular research papers, into meaningful groups is a difficult problem. Using the typical vector-space-model representation (Bag-of-words paradigm), difficulties arise due to its intrinsic high dimensionality, high redundancy of features, and the lack of semantic information. In this work we propose a document representation relying on a statistical feature reduction step, and an enrichment phase based on the introduction of higher abstraction terms, designated as metaterms, derived from text, using as prior knowledge papers topics and keywords. The proposed representation, combined with a clustering ensemble approach, leads to a novel document organization strategy. We evaluate the proposed approach taking as application domain conference papers, topic information being extracted from conference topics or areas. Performance evaluation on data sets from NIPS and INSTICC conferences show that the proposed approach leads to interesting and encouraging results.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Unsupervised Organisation of Scientific Documents

Lourenço¹,

Medina²,

Fred³

et al. 2011

Proceedings of the International Conference on Knowledge Discovery and Information Retrieval

View full text Add to dashboard Cite

show abstract

“…The term expansion process consists of replacing terms of a document with a set of co-related terms. This procedure may be carried out in different ways, often by using an external knowledge resource which usually helps in obtaining successful results [46][47][48].…”

Section: Self-term Expansionmentioning

confidence: 99%

A Self-enriching Methodology for Clustering Narrow Domain Short Texts

Pinto

Rosso

Jiménez-Salazar

2010

The Computer Journal

View full text Add to dashboard Cite

Clustering narrow domain short texts is considered to be a complex task because of the intrinsic features of the corpus to be clustered: (i) the low frequencies of vocabulary terms in short texts, and (ii) the high vocabulary overlapping associated to narrow domains. The aim of this paper is to introduce a selfterm expansion methodology for improving the performance of clustering methods when dealing with corpora of this kind. This methodology allows raw textual data to be enriched by adding co-related terms from an automatically constructed lexical knowledge resource obtained from the same target data set (and not from an external resource). We also propose a set of supervised and unsupervised text assessment measures for evaluating different corpus features, such as shortness, stylometry and domain broadness. With the help of these measures, we may determine beforehand whether or not to use the methodology proposed in this paper. Finally, we integrate all these assessment measures in a freely available web-based system named Watermarking Corpora On-line System, which may be used by computer scientists in order to evaluate the different features associated with a given textual corpus.

show abstract

“…To do so, we use the dual document representationconcepts and terms-to create a generative language model for each concept, which bridges the gap between vocabulary terms and concepts. Related work has also used textual representations to represent concepts, see e.g., [1,11], however, there are two important differences. First, we use statistical language modeling techniques to parametrize the concept models, by leveraging the dual representation of the documents.…”

Section: Introductionmentioning

confidence: 99%