Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification

Mavroeidis, Dimitrios; Tsatsaronis, George; Vazirgiannis, Michalis; Theobald, Martin; Weikum, Gerhard

doi:10.1007/11564126_21

Cited by 43 publications

(63 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Semantic-aware kernels have been proposed by Mavroeidis et al [4] who propose a generalized vector space model with WordNet senses and their hypernyms to improve text classification performance. Bloehdorn at al.…”

Section: Semantics In Text Mining and Information Retrievalmentioning

confidence: 99%

“…This latter definition of SR for a pair of terms is the definition of the Omiotis measure that we are using in our case. 4 4 Omiotis-based Semantic Kernel…”

Section: Semantic Relatedness and The Omiotis Measurementioning

confidence: 99%

“…The application of Word Sense Disambiguation (WSD) techniques [2] during document preprocessing can be helpful; however, this is usually computationally expensive, and the performance of the unsupervised techniques is poor while use of supervised techniques requires large amounts of hand-annotated text documents. The use of external semantic knowledge provided by word thesauri or ontologies to adjust or "smooth" the BOW representation has shown much promise [3,4]. However, the embedding of semantic information is usually computationally expensive.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Knowledge-Based Semantic Kernel for Text Classification

Nasir

Karim

Tsatsaronis

et al. 2011

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

Abstract. Typically, in textual document classification the documents are represented in the vector space using the "Bag of Words" (BOW ) approach. Despite its ease of use, BOW representation cannot handle word synonymy and polysemy problems and does not consider semantic relatedness between words. In this paper, we overcome the shortages of the BOW approach by embedding a known WordNet-based semantic relatedness measure for pairs of words, namely Omiotis, into a semantic kernel. The suggested measure incorporates the TF-IDF weighting scheme, thus creating a semantic kernel which combines both semantic and statistical information from text. Empirical evaluation with real data sets demonstrates that our approach successfully achieves improved classification accuracy with respect to the standard BOW representation, when Omiotis is embedded in four different classifiers.

show abstract

Section: Semantics In Text Mining and Information Retrievalmentioning

confidence: 99%

“…This latter definition of SR for a pair of terms is the definition of the Omiotis measure that we are using in our case. 4 4 Omiotis-based Semantic Kernel…”

Section: Semantic Relatedness and The Omiotis Measurementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Knowledge-Based Semantic Kernel for Text Classification

Nasir

Karim

Tsatsaronis

et al. 2011

String Processing and Information Retrieval

Self Cite

View full text Add to dashboard Cite

show abstract

“…Siolas and d'Alché Buc (2000) pioneered the idea of semantic kernels for text categorization, capitalizing on WordNet (Miller, 1995) to propose continuous word kernels based on the inverse of the path lengths in the tree rather than the common delta word kernel used so far, i. e. exact matching between unigrams. Bloehdorn et al (2006) extended it later to other tree-based similarity measures from WordNet while Mavroeidis et al (2005) exploited its hierarchical structure to define a Generalized Vector Space Model kernel.…”

Section: Introductionmentioning

confidence: 99%

Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization

Kim¹,

Rousseau

Vazirgiannis

2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Whereas we regard document categorization by SVM [30,50,49,4,38,6] a particular implementation of machine learning, an increasingly successful solution to the classical problem of automatic classification, we also envisage information representation by vectors, a standard point of departure for TC by SVM, a limitation of the above attempt, and combine the former with semantic content representation in Hilbert space instead of Euclidean space. In this new approach, instead of term and document vectors, term and document functions are used to represent the semantic content of digital objects, with the advantage that functions, having more parameters than vectors, can host more semantic content in a comprehensive description than vector space based methods.…”

Section: Introductionmentioning

confidence: 99%

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

2012

View full text Add to dashboard Cite

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by using standard test collections. In this paper we present a pilot experiment of replacing such test collections by a set of 6000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorisation can be applied to analyse the thematic coverage in digital repositories.

show abstract

Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification

Cited by 43 publications

References 11 publications

A Knowledge-Based Semantic Kernel for Text Classification

A Knowledge-Based Semantic Kernel for Text Classification

Convolutional Sentence Kernel from Word Embeddings for Short Text Categorization

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Contact Info

Product

Resources

About