Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2009
DOI: 10.1145/1557019.1557066
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting Wikipedia as external knowledge for document clustering

Abstract: In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
115
0
3

Year Published

2012
2012
2020
2020

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 214 publications
(118 citation statements)
references
References 12 publications
0
115
0
3
Order By: Relevance
“…In Hu et al [12] built document-concept matrix through exact-match and relatedness-match which requires to compute the tf-idf value of term in the whole Wikipedia article collection. In Gabrilovich and Markovitch [13], [14], [15] used machine learning techniques to map document to the most relevant concepts in ODP or Wikipedia by comparing the textual overlap between each document and article.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In Hu et al [12] built document-concept matrix through exact-match and relatedness-match which requires to compute the tf-idf value of term in the whole Wikipedia article collection. In Gabrilovich and Markovitch [13], [14], [15] used machine learning techniques to map document to the most relevant concepts in ODP or Wikipedia by comparing the textual overlap between each document and article.…”
Section: Related Workmentioning
confidence: 99%
“…The concepts with highest relatedness will be used to properly build the concept vector in semantic level, i.e., each term will be finally mapped into its most related concept. Based on Rel(t, c i |d j ) and term's weight w(t k , d j ), the concept's weight is defined as their weighted sum as follows (12).…”
Section: Fig 1 Overall Architecture Diagrammentioning
confidence: 99%
“…An effort for exploiting Wikipedia in document clustering by using Wikipedia concepts, redirects and category information can be found in [5]. They enriched text documents and developed two approaches for mapping text documents to Wikipedia concepts.…”
Section: Introductionmentioning
confidence: 99%
“…In terms of supervised text classification, the performance is determined by the accuracy of pre-classified training samples and the quality of the categorisation. The accuracy of classifiers determines their capability of differentiating the incoming stream of documents; the descriptive and discriminative capacity of categorisation reduces noise in classification, which is caused by sense ambiguities, sparsity, and high dimensionality of the documents [7]. Text classification performance is also affected by the topic coverage of categories.…”
Section: Introductionmentioning
confidence: 99%
“…Another world ontology commonly used in text classification is Wikipedia. Wang and Domeniconi [13] and Hu et al [7] derived background knowledge from Wikipedia to represent documents and attempted to deal with the sparsity and high dimensionality problems in text classification. Instead of Wikipedia with free-contributed entries, our work uses the superior LCSH ontology, which has been under continuous development for a hundred years by knowledge engineers.…”
Section: Introductionmentioning
confidence: 99%