Text categorization or classification (TC) is concerned with placing text documents in their proper category according to their contents. Owing to the various applications of TC and the large volume of text documents uploaded on the Internet daily, the need for such an automated method stems from the difficulty and tedium of performing such a process manually. The usefulness of TC is manifested in different fields and needs. For instance, the ability to automatically classify an article or an email into its right class (Arts, Economics, Politics, Sports, etc.) would be appreciated by individual users as well as companies. This paper is concerned with TC of Arabic articles. It contains a comparison of the five best known algorithms for TC. It also studies the effects of utilizing different Arabic stemmers (light and root-based stemmers) on the effectiveness of these classifiers. Furthermore, a comparison between different data mining software tools (Weka and RapidMiner) is presented. The results illustrate the good accuracy provided by the SVM classifier, especially when used with the light10 stemmer. This outcome can be used in future as a baseline to compare with other unexplored classifiers and Arabic stemmers.
We have put together a corpus of 242 abstracts of Arabic documents using the Proceedings of the Saudi Arabian National Conferences as a source. All these abstracts involve computer science and information systems. We also designed and built an automatic information retrieval system from scratch to handle Arabic data. The system was implemented in the C language using the GCC compiler and runs on IBM/PCs and compatible microcomputers. We have implemented both automatic and manual indexing techniques for this corpus. A long series of experiments using measures of recall and precision has demonstrated that automatic indexing is at least as effective as manual indexing and more effective in some cases. Since automatic indexing is both cheaper and faster, our results suggest that we can achieve a wider coverage of the literature with less money and produce as good results as with manual indexing. We have also compared the retrieval results using words as index terms versus stems and roots, and confirmed the results obtained by Al‐Kharashi and Abu‐Salem with smaller corpora that root indexing is more effective than word indexing. © 1997 John Wiley & Sons, Inc.
Most research in Arabic roots extraction focuses on removing affixes from Arabic words. This process adds processing overhead and may remove non-affix letters, which leads to the extraction of incorrect roots. This paper advises a new approach to dealing with this issue by introducing a new algorithm for extracting Arabic words’ roots. The proposed algorithm, which is called the Word Substring Stemming Algorithm, does not remove affixes during the extraction process. Rather, it is based on producing the set of all substrings of an Arabic word, and uses the Arabic roots file, the Arabic patterns file and a concrete set of rules to extract correct roots from substrings. The experiments have shown that the proposed approach is competitive and its accuracy is 83.9%, Furthermore, its accuracy can be enhanced more in the sense that, for about 9.9% of the tested words, the WSS algorithm retrieves two candidates (in most cases) for the correct root.
The performance of spatial queries depends mainly on the underlying index structure used to handle them. R-tree, a well-known spatial index structure, suffers largely from high overlap and high coverage resulting mainly from splitting the overflowed nodes. Assigning the remaining entries to the underflow node in order to meet the R-tree minimum fill constraint ( Remaining Entries problem) may induce high overlap or high coverage. This is done without considering the geometric features of the remaining entries and this may cause a very non-optimized expansion of that particular node. This paper presents a solution to the above problem. The proposed solution to this problem distributes rectangles as follows: (1) assign m entries to the first node, which are nearest to the first seed; (2) assign other m entries to the second node, which are nearest to the second seed; (3) assign the remaining entries one by one to the nearest seed. Several experiments on real data, as well as synthetic data, show that the proposed splitting algorithm outperforms the efficient version of the original R-tree in terms of query performance.
Root extraction is one of the most important topics in information retrieval (IR), natural language processing (NLP), text summarization, and many other important fields. In the last two decades, several algorithms have been proposed to extract Arabic roots. Most of these algorithms dealt with triliteral roots only, and some with fixed length words only. In this study, a novel approach to the extraction of roots from Arabic words using bigrams is proposed. Two similarity measures are used, the dissimilarity measure called the "Manhattan distance," and Dice's measure of similarity. The proposed algorithm is tested on the Holy Qu'ran and on a corpus of 242 abstracts from the Proceedings of the Saudi Arabian National Computer Conferences. The two files used contain a wide range of data: the Holy Qu'ran contains most of the ancient Arabic words while the other file contains some modern Arabic words and some words borrowed from foreign languages in addition to the original Arabic words. The results of this study showed that combining N -grams with the Dice measure gives better results than using the Manhattan distance measure.
Kidney tumor (KT) is one of the diseases that have affected our society and is the seventh most common tumor in both men and women worldwide. The early detection of KT has significant benefits in reducing death rates, producing preventive measures that reduce effects, and overcoming the tumor. Compared to the tedious and time-consuming traditional diagnosis, automatic detection algorithms of deep learning (DL) can save diagnosis time, improve test accuracy, reduce costs, and reduce the radiologist’s workload. In this paper, we present detection models for diagnosing the presence of KTs in computed tomography (CT) scans. Toward detecting and classifying KT, we proposed 2D-CNN models; three models are concerning KT detection such as a 2D convolutional neural network with six layers (CNN-6), a ResNet50 with 50 layers, and a VGG16 with 16 layers. The last model is for KT classification as a 2D convolutional neural network with four layers (CNN-4). In addition, a novel dataset from the King Abdullah University Hospital (KAUH) has been collected that consists of 8,400 images of 120 adult patients who have performed CT scans for suspected kidney masses. The dataset was divided into 80% for the training set and 20% for the testing set. The accuracy results for the detection models of 2D CNN-6 and ResNet50 reached 97%, 96%, and 60%, respectively. At the same time, the accuracy results for the classification model of the 2D CNN-4 reached 92%. Our novel models achieved promising results; they enhance the diagnosis of patient conditions with high accuracy, reducing radiologist’s workload and providing them with a tool that can automatically assess the condition of the kidneys, reducing the risk of misdiagnosis. Furthermore, increasing the quality of healthcare service and early detection can change the disease’s track and preserve the patient’s life.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.