In most conditions, it is a problematic mission for a machine-learning model with a data record, which has various attributes, to be trained. There is always a proportional relationship between the increase of model features and the arrival to the overfitting of the susceptible model. That observation occurred since not all the characteristics are always important. For example, some features could only cause the data to be noisier. Dimensionality reduction techniques are used to overcome this matter. This paper presents a detailed comparative study of nine dimensionality reduction methods. These methods are missing-values ratio, low variance filter, highcorrelation filter, random forest, principal component analysis, linear discriminant analysis, backward feature elimination, forward feature construction, and rough set theory. The effects of used methods on both training and testing performance were compared with two different datasets and applied to three different models. These models are, Artificial Neural Network (ANN), Support Vector Machine (SVM) and Random Forest classifier (RFC). The results proved that the RFC model was able to achieve the dimensionality reduction via limiting the overfitting crisis. The introduced RFC model showed a general progress in both accuracy and efficiency against compared approaches. The results revealed that dimensionality reduction could minimize the overfitting process while holding the performance so near to or better than the original one.
The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.
The success of any e-learning system depends on the retrieval of relevant learning contents according to the requirement of the learner (user). This leads to the development of the adaptive e-learning system to provide learning materials considering the requirements and understanding capability of the learner. This chapter aims to propose the system of personalized semantic search and recommendation of learning contents on the e-learning Web-based systems. Semantic and personalized search of learning contents is based on expansion the query keywords by using of the semantic relations and reasoning mechanism in the ontology. Personalized recommendation of learning objects is based on the learner profile ontology to guide what learning contents a learner should study. For the Arab world, to achieve the learning for all goals and meet the learner’s requirements, it must build more inclusive, including the personalization services, and has semantic learning content in the learning systems. The authors’ proposed system is efficient, more effective, and more learner-friendly in the Arab sector because it responds to every learner and his needs individually with a timely and precise adaptation of learning materials.
Nowadays, an unprecedented number of users interact through social media platforms and generate a massive amount of content due to the explosion of online communication. However, because user-generated content is unregulated, it may contain offensive content such as fake news, insults, and harassment phrases. The identification of fake news and rumors and their dissemination on social media has become a critical requirement. They have adverse effects on users, businesses, enterprises, and even political regimes and governments. State of the art has tackled the English language for news and used feature-based algorithms. This paper proposes a model architecture to detect fake news in the Arabic language by using only textual features. Machine learning and deep learning algorithms were used. The deep learning models are used depending on conventional neural nets (CNN), long short-term memory (LSTM), bidirectional LSTM (BiLSTM), CNN+LSTM, and CNN + BiLSTM. Three datasets were used in the experiments, each containing the textual content of Arabic news articles; one of them is reallife data. The results indicate that the BiLSTM model outperforms the other models regarding accuracy rate when both simple data split and recursive training modes are used in the training process.
Biometric identification is a very good candidate technology, which can facilitate a trusted user authentication with minimum constraints on the security of the access point. However, most of the biometric identification techniques require special hardware, thus complicate the access point and make it costly. Keystroke recognition is a biometric identification technique which relies on the user behavior while typing on the keyboard. It is a more secure and does not need any additional hardware to the access point. This paper presents a developed behavioral biometric authentication method which enables to identify the user based on his Keystroke Static Authentication (KSA) and describes an authentication system that explains the ability of keystroke technique to authenticate the user based on his template profile saved in the database. Also, an algorithm based on dynamic keystroke analysis has been presented, synthesized, simulated and implemented on Field Programmable Gate Array (FPGA). The proposed algorithm is tested on 25 individuals, achieving a False Rejection Rate (FRR) about 4% and a False Acceptance Rate (FAR) about 0%. This performance is reached using the same sampling text for all the individuals. In this paper, two methods are used to implement the proposed approach: method one (H/W based Sorter) and method two (S/W based Sorter) are achieved execution time about 50.653 ns and 9.650 ns, respectively. Method two achieved a lower execution time; the time in which the proposed algorithm is executed on FPGA board, compared to some published results. As the second method achieved a small execution time and area utilization so it is the preferred method to be implemented on FPGA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.