Automatic expansion of domain-specific lexicons by term categorization

Avancini, Henri; Lavelli, Alberto; Sebastiani, Fabrizio; Zanoli, Roberto

doi:10.1145/1138379.1138380

Cited by 13 publications

(18 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We chose to test the system using the dataset described in [1] referred in the following as DS. It is composed by a set of 27048 nouns assigned to one or more classes out of 42 different categories.…”

Section: Resultsmentioning

confidence: 99%

“…In [1], the authors approach the term categorization problem as the dual of text categorization. They validated the proposed model attempting to automatically replicate the WordNetDomains [2] lexicon (an extension to WordNet in which the synsets have been categorized into a subset of the DDC 1 scheme) by exploiting the Reuters Corpus.…”

Section: Introductionmentioning

confidence: 99%

“…They validated the proposed model attempting to automatically replicate the WordNetDomains [2] lexicon (an extension to WordNet in which the synsets have been categorized into a subset of the DDC 1 scheme) by exploiting the Reuters Corpus. We decided to adopt the same dataset to test the proposed system performing a fair and direct comparison with the results reported in [1].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Semantic Labeling of Data by Using the Web

Rigutini

Iorio

Ernandes

et al. 2006

2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops

View full text Add to dashboard Cite

This paper proposes a system for automatically categorizing terms or lexical entities into a predefined set of semantic domains. We present an approach that exploits the knowledge available in the Web to create a model of each term or entity (Entity Context Lexicons -ECLs). Each profile is simply a list of terms (similar to the Bag-Of-Words representation in text categorization) and it is composed primarily by the words often appearing in the same contexts of the entity. These profiles model the contexts in which the entity usually appears and they can be subsequently processed by an automatic classifier. Moreover, we propose and validate a profile-based categorization model developed for this particular task which uses the ECLs of the training entities to build a profile for each class (ClassContext lexicon -CCL). Finally, we propose a technique for dealing with multi-label classification based on a decision module that exploits a neural network. We show the effectiveness of the proposed approach on a term categorization task using a standard benchmark composed of a set of domain-specific lexicons (WordNetDomains).

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Semantic Labeling of Data by Using the Web

Rigutini

Iorio

Ernandes

et al. 2006

2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops

View full text Add to dashboard Cite

show abstract

“…The resultant feature vectors are then used by a centroid-based classifier using cosine similarity measure to label the words. Avancini, Lavelli, Sebastiani, and Zanoli (2006) take a classification approach to semantic lexicon construction. They cast the problem as a term (meaning both words and phrases) categorization task (dual of the document categorization task), and similar to the bag-of-word model, they represent the terms as bag-of-documents.…”

Section: Related Workmentioning

confidence: 99%

Cause Identification from Aviation Safety Incident Reports via Weakly Supervised Semantic Lexicon Construction

Abedin¹,

Ng²,

Khan³

2010

jair

View full text Add to dashboard Cite

The Aviation Safety Reporting System collects voluntarily submitted reports on aviation safety incidents to facilitate research work aiming to reduce such incidents. To effectively reduce these incidents, it is vital to accurately identify why these incidents occurred. More precisely, given a set of possible causes, or shaping factors, this task of cause identification involves identifying all and only those shaping factors that are responsible for the incidents described in a report. We investigate two approaches to cause identification. Both approaches exploit information provided by a semantic lexicon, which is automatically constructed via Thelen and Riloff's Basilisk framework augmented with our linguistic and algorithmic modifications. The first approach labels a report using a simple heuristic, which looks for the words and phrases acquired during the semantic lexicon learning process in the report. The second approach recasts cause identification as a text classification problem, employing supervised and transductive text classification algorithms to learn models from incident reports labeled with shaping factors and using the models to label unseen reports. Our experiments show that both the heuristic-based approach and the learning-based approach (when given sufficient training data) outperform the baseline system significantly.

show abstract

“…To apply machine learning (ML) to one of the standard DL circulation activities, namely text categorization [48], is part of the cognitive toolbox deployed [18]. In this context, ML is extensively being experimented with in different development areas and scenarios; to name but a few, for extracting image content from figures in scientific documents for categorization [33,34], automatically assessing and characterizing resource quality for educational DL [54,5], assessing the quality of scientific conferences [37], web-based collection development [42], automated document metadata extraction by support vector machines (SVM, [24]), automatic extraction of titles from general documents [27], information architecture [17], to remove duplicate documents [9], for collaborative filtering [59], for the automatic expansion of domain-specific lexicons by term categorization [3], for generating visual thesauri [45], or the semantic markup of documents [13]. As part of this direction of research, ML is being tested for its ability to reproduce parts of collections indexed by widespread classification schemes in a supervised learning setting, such as automatic text categorization using the Dewey Decimal Classification (DDC, [52]), or the Library of Congress Classification (LCC) from Library of Congress Subject Headings (LCSH, [20,43]).…”

Section: Introductionmentioning

confidence: 99%

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

2012

View full text Add to dashboard Cite

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by using standard test collections. In this paper we present a pilot experiment of replacing such test collections by a set of 6000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorisation can be applied to analyse the thematic coverage in digital repositories.

show abstract

Automatic expansion of domain-specific lexicons by term categorization

Cited by 13 publications

References 32 publications

Semantic Labeling of Data by Using the Web

Semantic Labeling of Data by Using the Web

Cause Identification from Aviation Safety Incident Reports via Weakly Supervised Semantic Lexicon Construction

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Contact Info

Product

Resources

About