Exploiting Wikipedia as external knowledge for document clustering

Hu, Xiaohua; Zhang, Xiaodan; Lu, Caimei; Park, E. K.; Zhou, Xiao‐Hua

doi:10.1145/1557019.1557066

Cited by 214 publications

(118 citation statements)

References 12 publications

Supporting

Mentioning

115

Contrasting

Unclassified

Order By: Relevance

“…In Hu et al [12] built document-concept matrix through exact-match and relatedness-match which requires to compute the tf-idf value of term in the whole Wikipedia article collection. In Gabrilovich and Markovitch [13], [14], [15] used machine learning techniques to map document to the most relevant concepts in ODP or Wikipedia by comparing the textual overlap between each document and article.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A novel semantic level text classification by combining NLP and Thesaurus concepts

Nagaraj¹,

Thiagarasu²

2014

IOSRJCE

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

“…The concepts with highest relatedness will be used to properly build the concept vector in semantic level, i.e., each term will be finally mapped into its most related concept. Based on Rel(t, c i |d j ) and term's weight w(t k , d j ), the concept's weight is defined as their weighted sum as follows (12).…”

Section: Fig 1 Overall Architecture Diagrammentioning

confidence: 99%

A novel semantic level text classification by combining NLP and Thesaurus concepts

Nagaraj¹,

Thiagarasu²

2014

IOSRJCE

View full text Add to dashboard Cite

“…An effort for exploiting Wikipedia in document clustering by using Wikipedia concepts, redirects and category information can be found in [5]. They enriched text documents and developed two approaches for mapping text documents to Wikipedia concepts.…”

Section: Introductionmentioning

confidence: 99%

Aspect-Oriented Document Clustering for Facilitating Retrieval Process

Hosseinia¹,

Badie²,

Moeini³

2012

IJCTE

View full text Add to dashboard Cite

Abstract-Retrieved documents from queries are clustered to help users find information needed more significant in information retrieval. There are some frequent queries try finding information on an issue from the aspect of another issue. But current methods of clustering do not pay attention to the concept of the aspect included in these queries after retrieval process. In this paper we introduce aspect-oriented document clustering to group documents more significant and based on a special point of view. In our approach, text documents are represented based on a special aspect and the similarity between them is computed on the basis of its features. We use Wikipedia as background knowledge to emphasize and enrich the concept of the aspect. Then we evaluate the proposed approach with selected documents from two popular datasets, 20 Newsgroups and Reuters 21578. Results demonstrate that aspect-oriented clustering enhances clustering performance of those documents which can be equivalent to retrieved documents from aspect based queries significantly.

show abstract

“…In terms of supervised text classification, the performance is determined by the accuracy of pre-classified training samples and the quality of the categorisation. The accuracy of classifiers determines their capability of differentiating the incoming stream of documents; the descriptive and discriminative capacity of categorisation reduces noise in classification, which is caused by sense ambiguities, sparsity, and high dimensionality of the documents [7]. Text classification performance is also affected by the topic coverage of categories.…”

Section: Introductionmentioning

confidence: 99%

“…Another world ontology commonly used in text classification is Wikipedia. Wang and Domeniconi [13] and Hu et al [7] derived background knowledge from Wikipedia to represent documents and attempted to deal with the sparsity and high dimensionality problems in text classification. Instead of Wikipedia with free-contributed entries, our work uses the superior LCSH ontology, which has been under continuous development for a hundred years by knowledge engineers.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Multi-label Text Classification Using a World Knowledge Ontology

Tao

Lau

et al. 2012

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Abstract. The development of text classification techniques has been largely promoted in the past decade due to the increasing availability and widespread use of digital documents. Usually, the performance of text classification relies on the quality of categories and the accuracy of classifiers learned from samples. When training samples are unavailable or categories are unqualified, text classification performance would be degraded. In this paper, we propose an unsupervised multi-label text classification method to classify documents using a large set of categories stored in a world ontology. The approach has been promisingly evaluated by compared with typical text classification methods, using a real-world document collection and based on the ground truth encoded by human experts.

show abstract

Exploiting Wikipedia as external knowledge for document clustering

Cited by 214 publications

References 12 publications

A novel semantic level text classification by combining NLP and Thesaurus concepts

A novel semantic level text classification by combining NLP and Thesaurus concepts

Aspect-Oriented Document Clustering for Facilitating Retrieval Process

Unsupervised Multi-label Text Classification Using a World Knowledge Ontology

Contact Info

Product

Resources

About