ShotgunWSD: An unsupervised algorithm for global word sense
            disambiguation inspired by DNA sequencing

Butnaru, Andrei M.; Ionescu, Radu Tudor; Hristea, Florentina

doi:10.18653/v1/e17-1086

Cited by 8 publications

(16 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A score is assigned to each sense configuration by computing the semantic relatedness between word senses (steps 16-19), as described by Patwardhan et al [33]. Butnaru et al [25] alternatively employed two measures to compute the semantic relatedness, one is the extended Lesk measure [31], [32] and the other is a simple approach based on deriving sense embeddings from word embeddings [36]. In this paper, we propose a third approach that is based on clustering word vectors with k-means and on eliminating the smaller clusters (which contain outlier words).…”

Section: Methodsmentioning

confidence: 99%

“…In this paper, we present an improved version of a recently introduced WSD algorithm [25], termed ShotgunWSD, 1 which stems from the Shotgun genome sequencing technique [26], [27]. ShotgunWSD is unsupervised, but it also requires knowledge in the form of WordNet synsets and relations [28], [29].…”

Section: Introductionmentioning

confidence: 99%

“…It employs a local WSD algorithm to build the local sense configurations. Butnaru et al [25] alternatively used two methods for this step, namely the extended Lesk measure [31], [32] and an approach based on deriving sense embeddings from word embeddings [34]- [36].…”

Section: Introductionmentioning

confidence: 99%

“…We present experiments on SemEval 2007 [37], Senseval-2 [38], Senseval-3 [39], SemEval 2013 [40], SemEval 2015 [41], as well as on the unified data sets [11], in order to compare ShotgunWSD 2.0 with its previous version [25], other state-of-the-art unsupervised and knowledge-based approaches [15], [18]- [23], [42]- [45], as well as the Most Common Sense (MCS) baseline. 2 MCS is considered as one of the strongest baselines in WSD [7], surpassing all unsupervised approaches in the recent SemEval 2015 [41] WSD task.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation

Butnaru

Ionescu

2019

IEEE Access

Self Cite

View full text Add to dashboard Cite

ShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases. The first phase consists of applying a brute-force WSD algorithm on short context windows selected from the document, in order to generate a short list of likely sense configurations for each window. The second phase consists of assembling the local sense configurations into longer composite configurations by prefix and suffix matching. In the third phase, the resulting configurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top configurations in which the respective word appears. In this paper, we present an improved version (2.0) of ShotgunWSD which is based on a different approach for computing the relatedness score between two word senses, a step that stays at the core of building better local sense configurations. For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag. We embed the collected words from all the sense bags in the entire document into a vector space using a common word embedding framework. The word vectors are then clustered using k-means to form clusters of semantically related words. At this stage, we consider that clusters with fewer samples (with respect to a given threshold) represent outliers and we eliminate these clusters altogether. Words from the eliminated clusters are also removed from each and every sense bag. Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense. We compare the improved ShotgunWSD algorithm (version 2.0) with its previous version (1.0) as well as several state-of-the-art unsupervised WSD algorithms on six benchmarks: SemEval 2007, Senseval-2, Senseval-3, SemEval 2013, SemEval 2015, and overall (unified). We demonstrate that ShotgunWSD 2.0 yields better performance than ShotgunWSD 1.0 and some other recent unsupervised or knowledge-based approaches. We also performed paired McNemar's significance tests, showing that the improvements of ShotgunWSD 2.0 over ShotgunWSD 1.0 are in most cases statistically significant, with a confidence interval of 0.01. INDEX TERMS Word sense disambiguation, shotgun sequencing, word embeddings, outlier removal.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation

Butnaru

Ionescu

2019

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…Based on WordNet, we form the sense bag for a given synset by collecting the words found in the gloss of the synset (examples included) as well as the words found in the glosses of semantically related synsets. The semantic relations are chosen based on the part-ofspeech of the target word, as described in (Butnaru et al, 2017). To derive the sense embedding, we embed the collected words in an embedding space and compute the median of the resulted word vectors.…”

Section: Feature Extractionmentioning

confidence: 99%

UnibucKernel: A kernel-based learning method for complex word identification

Butnaru¹,

Ionescu²

2018

Proceedings of the Thirteenth Workshop on Innovative Use of NLP For Building Educational Applications

Self Cite

View full text Add to dashboard Cite

In this paper, we present a kernel-based learning approach for the 2018 Complex Word Identification (CWI) Shared Task. Our approach is based on combining multiple lowlevel features, such as character n-grams, with high-level semantic features that are either automatically learned using word embeddings or extracted from a lexical knowledge base, namely WordNet. After feature extraction, we employ a kernel method for the learning phase. The feature matrix is first transformed into a normalized kernel matrix. For the binary classification task (simple versus complex), we employ Support Vector Machines. For the regression task, in which we have to predict the complexity level of a word (a word is more complex if it is labeled as complex by more annotators), we employ ν-Support Vector Regression. We applied our approach only on the three English

show abstract