This study focuses on high-dimensional text data clustering, given the inability of K-means to process high-dimensional data and the need to specify the number of clusters and randomly select the initial centers. We propose a Stacked-Random Projection dimensionality reduction framework and an enhanced K-means algorithm DPC-K-means based on the improved density peaks algorithm. The improved density peaks algorithm determines the number of clusters and the initial clustering centers of K-means. Our proposed algorithm is validated using seven text datasets. Experimental results show that this algorithm is suitable for clustering of text data by correcting the defects of K-means.
Word sense disambiguation is a basic task in Natural Language Processing which aims to identify the most appropriate sense of ambiguous words in different contexts by applying algorithm models. In this work, we propose a model that uses a stacked bidirectional Long Short-Term Memory neural network and attention mechanism to determine the sense of ambiguous words. First, the stacked bidirectional Long Short-Term Memory is employed for deep embedding-based representation of sentences containing ambiguous words. Then, we utilize the self-attention mechanism to highlight the contextual features of ambiguous words, and then construct the overall semantic representation of sentences. Finally, the sentence semantic representation is applied to the multilayer perception classifier to generate the appropriate category of the ambiguous word sense items. This model is tested on the Semeval-2007 task-17: English lexical samples dataset and using examples of ambiguous words sourced from Oxford, Cambridge, and Collins dictionaries as extra test datasets. The effectiveness of the proposed approach is demonstrated via comparison with existing word sense disambiguation approaches. Our experimental results show that the proposed model outperforms other word sense disambiguation methods in terms of the evaluation metrics (Average Accuracy, Micro F1-Score, Kappa, and Matthews Correlation Coefficient), and exhibits strong interpretability.
Polysemy is an inherent characteristic of natural language. In order to make it easier to distinguish between different senses of polysemous words, we propose a method for encoding multiple different senses of polysemous words using a single vector. The method first uses a two-layer bidirectional long short-term memory neural network and a self-attention mechanism to extract the contextual information of polysemous words. Then, a K-means algorithm, which is improved by optimizing the density peaks clustering algorithm based on cosine similarity, is applied to perform word sense induction on the contextual information of polysemous words. Finally, the method constructs the corresponding word sense embedded representations of the polysemous words. The results of the experiments demonstrate that the proposed method produces better word sense induction than Euclidean distance, Pearson correlation, and KL-divergence and more accurate word sense embeddings than mean shift, DBSCAN, spectral clustering, and agglomerative clustering.
Aiming at the long training time when classifying high‐dimensional data, a parallel classification model is proposed based on random projection and Bagging‐support vector machine (SVM) to process high‐dimensional data. The model first uses random projection to project the input data into the low‐dimensional space. Then, we used the Bagging method to construct multiple training data subsets and used SVM to train the training subset in parallel and generate several subclassifiers. Finally, various classifiers vote to determine the category of the test sample. The model has been verified using two standard datasets. The experimental results show that the model can significantly improve the training speed and classification performance of high‐dimensional data with little accuracy loss.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.