This work presents a new alignment word-space approach for measuring the similarity between two snipped texts. The approach combines two similarity measurement methods: alignment-based and vector space-based. The vector space-based method depends on a semantic net that represents the meaning of words as vectors. These vectors are lemmatized to enrich the search space. The alignment-based method generates an alignment word space matrix (AWSM) for the snipped texts according to the generated semantic word spaces. Finally, the degree of sentence semantic similarity is measured using some proposed alignment rules. Four experiments were carried out to evaluate the performance of the proposed approach, using two different datasets. The experimental results proved that applying the lemmatization process for the input text and the vector model has a better effect. The degree of correctness of the results reaches 0.7212 which is considered one of the best two results of the published Arabic semantic similarities.
Textual similarity is one of the most important aspects of information retrieval. This paper proposes several techniques of semantic textual similarity as well as the factors that influence them. Two-hybrid approaches for measuring the degree of similarity between two Arabic snipped texts are presented. The first proposed approach combined the word-based and vectorbased similarity methods to construct semantic word spaces for each word of the input text. These words are represented in their lemma forms to capture all semantically related words. In this approach, the semantic word spaces are used to find the best matching between the input text words, and hence, the degree of similarity between the two snipped texts is computed. The second proposed approach combined semantic and syntactic based approaches. The basic Levenshtein concept represents the main structure for this approach. It has been modified to measure the edit cost at the token level not at the character level. In addition, the semantic word spaces are added to this approach to include the semantic features to the syntactic features. Some techniques are embedded to overcome the syntactic approach problems such as the word sequence. Pearson correlation coefficient is used to measure the degree of correctness of the two proposed approaches as compared to two benchmark datasets. The experiments achieved 0.7212 and 0.7589 for the two proposed approaches on two different datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.