This paper aims to fully present a new word sense disambiguation method that has been introduced in Hristea and Popescu (Fundam Inform 91(3-4):547-562, 2009) and so far tested in the case of adjectives (Hristea and Popescu in Fundam Inform 91(3-4):547-562, 2009) and verbs (Hristea in Int Rev Comput Softw 4(1):58-67, 2009).We hereby extend the method to the case of nouns and draw conclusions regarding its performance with respect to all these parts of speech. The method lies at the border between unsupervised and knowledge-based techniques. It performs unsupervised word sense disambiguation based on an underlying Naïve Bayes model, while using WordNet as knowledge source for feature selection. The performance of the method is compared to that of previous approaches that rely on completely different feature sets. Test results for all involved parts of speech show that feature selection using a knowledge source of type WordNet is more effective in disambiguation than local type features (like part-of-speech tags) are.
WordNet, a lexical database for English that is extensively used by computational linguists, has not previously distinguished hyponyms that are classes from hyponyms that are instances. This note describes an attempt to draw that distinction and proposes a simple way to incorporate the results into future versions of WordNet.
In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widely-used approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a brute-force WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other state-of-the-art unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large margin. We also show that our algorithm can yield better performance than the Most Common Sense (MCS) baseline on one data set. Moreover, our algorithm has a very small number of parameters, is robust to parameter tuning, and, unlike other bioinspired methods, it gives a deterministic solution (it does not involve random choices).
This paper ultimately discusses the importance of the clustering method used in unsupervised word sense disambiguation. It illustrates the fact that a powerful clustering technique can make up for lack of external knowledge of all types. It argues that feature selection does not always improve disambiguation results, especially when using an advanced, state of the art method, hereby exemplified by spectral clustering. Disambiguation results obtained when using spectral clustering in the case of the main parts of speech (nouns, adjectives, verbs) are compared to those of the classical clustering method given by the Naïve Bayes model. In the case of unsupervised word sense disambiguation with an underlying Naïve Bayes model feature selection performed in two completely different ways is surveyed. The type of feature selection providing the best results (WordNet-based feature selection) is equally being used in the case of spectral clustering. The conclusion is that spectral clustering without feature selection (but using its own feature weighting) produces superior disambiguation results in the case of all parts of speech.
Word sense ambiguity has been identified as a cause of poor precision in information retrieval (IR) systems. Word sense disambiguation and discrimination methods have been defined to help systems choose which documents should be retrieved in relation to an ambiguous query. However, the only approaches that show a genuine benefit for word sense discrimination or disambiguation in IR are generally supervised ones. In this paper we propose a new unsupervised method that uses word sense discrimination in IR. The method we develop is based on spectral clustering and reorders an initially retrieved document list by boosting documents that are semantically similar to the target query. For several TREC ad hoc collections we show that our method is useful in the case of queries which contain ambiguous terms. We are interested in improving the level of precision after 5, 10 and 30 retrieved documents (P@5, P@10, P@30) respectively. We show that precision can be improved by 8% above current state-of-the-art baselines. We also focus on poor performing queries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.