Summary
Short text similarity plays an important role in natural language processing (NLP). It has been applied in many fields. Due to the lack of sufficient context in the short text, it is difficult to measure the similarity. The use of semantics similarity to calculate textual similarity has attracted the attention of academia and industry and achieved better results. In this survey, we have conducted a comprehensive and systematic analysis of semantic similarity. We first propose three categories of semantic similarity: corpus‐based, knowledge‐based, and deep learning (DL)‐based. We analyze the pros and cons of representative and novel algorithms in each category. Our analysis also includes the applications of these similarity measurement methods in other areas of NLP. We then evaluate state‐of‐the‐art DL methods on four common datasets, which proved that DL‐based can better solve the challenges of the short text similarity, such as sparsity and complexity. Especially, bidirectional encoder representations from transformer model can fully employ scarce information of short texts and semantic information and obtain higher accuracy and F1 value. We finally put forward some future directions.
Named Entity Recognition (NER) for cyber security aims to identify and classify cyber security terms from a large number of heterogeneous multisource cyber security texts. In the field of machine learning, deep neural networks automatically learn text features from a large number of datasets, but this data-driven method usually lacks the ability to deal with rare entities. Gasmi et al. proposed a deep learning method for named entity recognition in the field of cyber security, and achieved good results, reaching an F1 value of 82.8%. But it is difficult to accurately identify rare entities and complex words in the text.To cope with this challenge, this paper proposes a new model that combines data-driven deep learning methods with knowledge-driven dictionary methods to build dictionary features to assist in rare entity recognition. In addition, based on the data-driven deep learning model, an attention mechanism is adopted to enrich the local features of the text, better models the context, and improves the recognition effect of complex entities. Experimental results show that our method is better than the baseline model. Our model is more effective in identifying cyber security entities. The Precision, Recall and F1 value reached 90.19%, 86.60% and 88.36% respectively.
Tool fault diagnosis in numerical control (NC) machines plays a significant role in ensuring manufacturing quality. However, current methods of tool fault diagnosis lack accuracy. Therefore, in the present paper, a fault diagnosis method was proposed based on stationary subspace analysis (SSA) and least squares support vector machine (LS-SVM) using only a single sensor. First, SSA was used to extract stationary and non-stationary sources from multi-dimensional signals without the need for independency and without prior information of the source signals, after the dimensionality of the vibration signal observed by a single sensor was expanded by phase space reconstruction technique. Subsequently, 10 dimensionless parameters in the time-frequency domain for non-stationary sources were calculated to generate samples to train the LS-SVM. Finally, the measured vibration signals from tools of an unknown state and their non-stationary sources were separated by SSA to serve as test samples for the trained SVM. The experimental validation demonstrated that the proposed method has better diagnosis accuracy than three previous methods based on LS-SVM alone, Principal component analysis and LS-SVM or on SSA and Linear discriminant analysis.
Twitter exhibits several characteristics, including a limited number of features and noisy text information. Extracting valuable information from Twitter has made hot topic detection a challenging task. In this paper, a novel four-stage framework is proposed to improve the performance of topic detection. Data preprocessing is the first stage. Deep learning is then exploited to enrich short text information via image understanding. Next, improved latent Dirichlet allocation is used to optimize the image effective word pairs, which improves the accuracy of the extracted topic words. Finally, both short text and images are integrated for topic detection, in which the corresponding topics are mined based on fuzzy matching of topic words. A large number of experiments show that the proposed framework significantly improves the performance of topic detection and outperforms the selected baseline methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.