Recently, Sentiment analysis from Twitter is one of the most interesting research disciplines; it combined data mining technologies with natural language processing techniques. The sentiment analysis system aims to evaluate the texts that are posted on social platforms to express positive, negative, or neutral feelings of people regarding a certain domain. The high dimensionality of the feature vector is considered to be one of the most popular problems of Arabic sentiment analysis. The main contribution of this paper is to solve the dimensionality problem by presenting a comparative study between two feature selection algorithms, namely, Information Gain (IG), and Chi-Square to choose the best one which may lead to improve the classification accuracy. In this paper, the Arabic Jordanian sentiment analysis model is proposed through four steps. First, a preprocessing step has been applied to the database and includes (Remove Non-Arabic Symbols, Tokenizing, Arabic Stop Word Removal, and Stemming). In the second step, the TF-IDF algorithm is used as a feature extraction method to represent the text into feature vectors. Then, we utilized IG and Chi-Square as feature selection steps to obtain the best subset of features and decrease the total number of features. Finally, different algorithms have been used in the classification step such as (SVM, DT, and KNN) to classify the views people have shared on Twitter, into two classes (positive, and negative). Several experiments were performed on Jordanian dialectical tweets using the AJGT database. The experimental results show the following: 1) The information acquisition algorithm outperformed the Chi-Square Algorithm in the feature selection step, as it was able to reduce the number of features from 1170 to 713 and increase the accuracy of the classifiers by 10%, 2) SVM classifier shows the greatest classification performance among all the classifiers tested which gives the highest accuracy of 85% with IG algorithm.
The application of big data in health care is a fast-growing field, with many discoveries and methodologies published in the last five years. Big data refers to datasets that are not only big but also high in variety and velocity, which makes them difficult to handle using traditional tools and techniques. Moreover, medical data is one of the most growing data, as it is obtained from Electronic Health Records (EHRs) or patients themselves. Due to the rapid growth of such medical data, we need to provide suitable tools and techniques in order to handle and extract value and knowledge from these datasets to improve the quality of patient care and reduces healthcare costs. Furthermore, such value can be provided using big data analytics, which is the application of advanced analytics techniques on big data. This paper presents an overview of big data content, sources, technologies, tools, and challenges in health care. It also intends to identify the strategies to overcome the challenges.
These days, heart disease comes to be one of the major health problems which have affected the lives of people in the whole world. Moreover, death due to heart disease is increasing day by day. So the heart disease prediction systems play an important role in the prevention of heart problems. Where these prediction systems assist doctors in making the right decision to diagnose heart disease easily. The existing prediction systems suffering from the high dimensionality problem of selected features that increase the prediction time and decrease the performance accuracy of the prediction due to many redundant or irrelevant features. Therefore, this paper aims to provide a solution of the dimensionality problem by proposing a new mixed model for heart disease prediction based on (Naïve Bayes method, and machine learning classifiers).In this study, we proposed a new heart disease prediction model (NB-SKDR) based on the Naïve Bayes algorithm (NB) and several machine learning techniques including Support Vector Machine, K-Nearest Neighbors, Decision Tree, and Random Forest. This prediction model consists of three main phases which include: preprocessing, feature selection, and classification. The main objective of this proposed model is to improve the performance of the prediction system and finding the best subset of features. This proposed approach uses the Naïve Bayes technique based on the Bayes theorem to select the best subset of features for the next classification phase, also to handle the high dimensionality problem by avoiding unnecessary features and select only the important ones in an attempt to improve the efficiency and accuracy of classifiers. This method is able to reduce the number of features from 13 to 6 which are (age, gender, blood pressure, fasting blood sugar, cholesterol, exercise induce engine) by determining the dependency between a set of attributes. The dependent attributes are the attributes in which an attribute depends on the other attribute in deciding the value of the class attribute. The dependency between attributes is measured by the conditional probability, which can be easily computed by Bayes theorem. Moreover, in the classification phase, the proposed system uses different classification algorithms such as (DT Decision Tree, RF Random Forest, SVM Support Vector machine, KNN Nearest Neighbors) as a classifiers for predicting whether a patient has heart disease or not. The model is trained and evaluated using the Cleveland Heart Disease database, which contains 13 features and 303 samples.Different algorithms use different rules for producing different representations of knowledge. So, the selection of algorithms to build our model is based on their performance. In this work, we applied and compared several classification algorithms which are (DT, SVM, RF, and KNN) to identify the best-suited algorithm to achieve high accuracy in the prediction of heart disease. After combining the Naive Bayes method with each one of these previous classifiers the performance of these combines algorithms is evaluated by different performance metrics such as (Specificity, Sensitivity, and Accuracy). Where the experimental results show that out of these four classification models, the combination between the Naive Bayes feature selection approach and the SVM RBF classifier can predict heart disease with the highest accuracy of 98%. Finally, the proposed approach is compared with another two systems which developed based on two different approaches in the feature selection step. The first system, based on the Genetic Algorithm (GA) technique, and the second uses the Principal Component Analysis (PCA) technique. Consequently, the comparison proved that the Naive Bayes selection approach of the proposed system is better than the GA and PCA approach in terms of prediction accuracy.
Generally, medical dataset classification has become one of the biggest problems in data mining research. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as disrupt the process of classification and this problem is known as a high dimensionality problem. Dimensionality reduction in data preprocessing is critical for increasing the performance of machine learning algorithms. Besides the contribution of feature subset selection in dimensionality reduction gives a significant improvement in classification accuracy. In this paper, we proposed a new hybrid feature selection approach based on (GA assisted by KNN) to deal with issues of high dimensionality in biomedical data classification. The proposed method first applies the combination between GA and KNN for feature selection to find the optimal subset of features where the classification accuracy of the k-Nearest Neighbor (kNN) method is used as the fitness function for GA. After selecting the best-suggested subset of features, Support Vector Machine (SVM) are used as the classifiers. The proposed method experiments on five medical datasets of the UCI Machine Learning Repository. It is noted that the suggested technique performs admirably on these databases, achieving higher classification accuracy while using fewer features.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.