Arabic natural language processing (ANLP) consists of developing techniques and tools that can utilize and analyze the Arabic language in both written and spoken contexts. ANLP makes an important contribution to many existing developed systems. It provides Arabic and non-Arabic speakers with helpful and convenient tools that can be used in different domains. Modern ANLP tools are developed using machine learning (ML) techniques. ML algorithms are widely used in NLP because of their high accuracy rate regardless of the robustness of the data that is used and because of the ease with which they can be implemented. On the other hand, the methodology of ANLP applications based on ML involves several distinct phases. It is, therefore, crucial to recognize and understand these phases in detail as well as the most widely used ML algorithms. This survey discusses this concept in detail, shows the involvement of ML techniques in developing such tools, and identifies well-known techniques used in ANLP. Moreover, this survey discusses the characteristics and complexity of the Arabic language in addition to the importance and needs of ANLP.INDEX TERMS Arabic natural language processing, classification, feature selection, machine learning.
Diabetes is one of the most common diseases worldwide. Many Machine Learning (ML) techniques have been utilized in predicting diabetes in the last couple of years. The increasing complexity of this problem has inspired researchers to explore the robust set of Deep Learning (DL) algorithms. The highest accuracy achieved so far was 95.1% by a combined model CNN-LSTM. Even though numerous ML algorithms were used in solving this problem, there are a set of classifiers that are rarely used or even not used at all in this problem, so it is of interest to determine the performance of these classifiers in predicting diabetes. Moreover, there is no recent survey that has reviewed and compared the performance of all the proposed ML and DL techniques in addition to combined models. This article surveyed all the ML and DL techniques-based diabetes predictions published in the last six years. In addition, one study was developed that aimed to implement those rarely and not used ML classifiers on the Pima Indian Dataset to analyze their performance. The classifiers obtained an accuracy of 68%–74%. The recommendation is to use these classifiers in diabetes prediction and enhance them by developing combined models.
Exploratory Projection Pursuit (EPP) methods have been developed thirty years ago in the context of exploratory analysis of large data sets. These methods consist in looking for low-dimensional projections that reveal some interesting structure existing in the data set but not visible in high dimension. Each projection is associated with a real valued index which optima correspond to valuable projections. Several EPP indices have been proposed in the statistics literature but the main problem lies in their optimization. In the present paper, we propose to apply Genetic Algorithms (GA) and recent Particle Swarm Optimization (PSO) algorithm to the optimization of several projection pursuit indices. We explain how the EPP methods can be implemented in order to become an efficient and powerful tool for the statistician. We illustrate our proposal on several simulated and real data sets.
Breast Cancer is a common disease among females. Early detection of the Breast Cancer aids in an easier efficient treatment. The application of Machine Learning algorithms can help in the diagnosis of this disease. There are three main problems related to Breast Cancer. The existing
works focused only on one problem. In addition, the resulted accuracy still needs improvement. This research paper aims to identify the Breast Cancer diagnosis, predict the recurrence of the disease, and predict the survivability of its patients. This is achieved by using the Feedforward Neural
Network (FFN) on the SEER (Surveillance, Epidemiology, and End Results) dataset by using different attributes and preprocessing of data for each problem. The obtained FFN classification accuracy resulted in 99.8% for the Breast Cancer diagnosis, 88.1% for the Breast Cancer recurrence, and
97.3% for the survivability.
In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.