Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.
With the increasing demand for information security and security regulations all over the world, biometric recognition technology has been widely used in our everyday life. In this regard, multimodal biometrics technology has gained interest and became popular due to its ability to overcome a number of significant limitations of unimodal biometric systems. In this paper, a new multimodal biometric human identification system is proposed, which is based on a deep learning algorithm for recognizing humans using biometric modalities of iris, face, and finger vein. The structure of the system is based on convolutional neural networks (CNNs) which extract features and classify images by softmax classifier. To develop the system, three CNN models were combined; one for iris, one for face, and one for finger vein. In order to build the CNN model, the famous pertained model VGG-16 was used, the Adam optimization method was applied and categorical cross-entropy was used as a loss function. Some techniques to avoid overfitting were applied, such as image augmentation and dropout techniques. For fusing the CNN models, different fusion approaches were employed to explore the influence of fusion approaches on recognition performance, therefore, feature and score level fusion approaches were applied. The performance of the proposed system was empirically evaluated by conducting several experiments on the SDUMLA-HMT dataset, which is a multimodal biometrics dataset. The obtained results demonstrated that using three biometric traits in biometric identification systems obtained better results than using two or one biometric traits. The results also showed that our approach comfortably outperformed other state-of-the-art methods by achieving an accuracy of 99.39%, with a feature level fusion approach and an accuracy of 100% with different methods of score level fusion.
Fake news detection (FND) involves predicting the likelihood that a particular news article (news report, editorial, expose, etc.) is intentionally deceptive. Arabic FND started to receive more attention in the last decade, and many detection approaches demonstrated some ability to detect fake news on multiple datasets. However, most existing approaches do not consider recent advances in natural language processing, i.e., the use of neural networks and transformers. This paper presents a comprehensive comparative study of neural network and transformer-based language models used for Arabic FND. We examine the use of neural networks and transformer-based language models for Arabic FND and show their performance compared to each other. We also conduct an extensive analysis of the possible reasons for the difference in performance results obtained by different approaches. The results demonstrate that transformer-based models outperform the neural network-based solutions, which led to an increase in the F1 score from 0.83 (best neural network-based model, GRU) to 0.95 (best transformer-based model, QARiB), and it boosted the accuracy by 16% compared to the best in neural network-based solutions. Finally, we highlight the main gaps in Arabic FND research and suggest future research directions.
Recent advances in information technology have induced an explosive growth of data, creating a new era of big data. Unfortunately, traditional machine-learning algorithms cannot cope with the new characteristics of big data. In this paper, we address the problem of breast cancer prediction in the big data context. We considered two varieties of data, namely, gene expression (GE) and DNA methylation (DM). The objective of this paper is to scale up the machine-learning algorithms that are used for classification by applying each dataset separately and jointly. For this purpose, we chose Apache Spark as a platform. In this paper, we selected three different classification algorithms, namely, support vector machine (SVM), decision tree, and random forest, to create nine models that help in predicting breast cancer. We conducted a comprehensive comparative study using three scenarios with the GE, DM, and GE and DM combined, in order to show which of the three types of data would produce the best result in terms of accuracy and error rate. Moreover, we performed an experimental comparison between two platforms (Spark and Weka) in order to show their behavior when dealing with large sets of data. The experimental results showed that the scaled SVM classifier in the Spark environment outperforms the other classifiers, as it achieved the highest accuracy and the lowest error rate with the GE dataset.
Background The massive scale of social media platforms requires an automatic solution for detecting hate speech. These automatic solutions will help reduce the need for manual analysis of content. Most previous literature has cast the hate speech detection problem as a supervised text classification task using classical machine learning methods or, more recently, deep learning methods. However, work investigating this problem in Arabic cyberspace is still limited compared to the published work on English text. Objective This study aims to identify hate speech related to the COVID-19 pandemic posted by Twitter users in the Arab region and to discover the main issues discussed in tweets containing hate speech. Methods We used the ArCOV-19 dataset, an ongoing collection of Arabic tweets related to COVID-19, starting from January 27, 2020. Tweets were analyzed for hate speech using a pretrained convolutional neural network (CNN) model; each tweet was given a score between 0 and 1, with 1 being the most hateful text. We also used nonnegative matrix factorization to discover the main issues and topics discussed in hate tweets. Results The analysis of hate speech in Twitter data in the Arab region identified that the number of non–hate tweets greatly exceeded the number of hate tweets, where the percentage of hate tweets among COVID-19 related tweets was 3.2% (11,743/547,554). The analysis also revealed that the majority of hate tweets (8385/11,743, 71.4%) contained a low level of hate based on the score provided by the CNN. This study identified Saudi Arabia as the Arab country from which the most COVID-19 hate tweets originated during the pandemic. Furthermore, we showed that the largest number of hate tweets appeared during the time period of March 1-30, 2020, representing 51.9% of all hate tweets (6095/11,743). Contrary to what was anticipated, in the Arab region, it was found that the spread of COVID-19–related hate speech on Twitter was weakly related with the dissemination of the pandemic based on the Pearson correlation coefficient (r=0.1982, P=.50). The study also identified the commonly discussed topics in hate tweets during the pandemic. Analysis of the 7 extracted topics showed that 6 of the 7 identified topics were related to hate speech against China and Iran. Arab users also discussed topics related to political conflicts in the Arab region during the COVID-19 pandemic. Conclusions The COVID-19 pandemic poses serious public health challenges to nations worldwide. During the COVID-19 pandemic, frequent use of social media can contribute to the spread of hate speech. Hate speech on the web can have a negative impact on society, and hate speech may have a direct correlation with real hate crimes, which increases the threat associated with being targeted by hate speech and abusive language. This study is the first to analyze hate speech in the context of Arabic COVID-19–related tweets in the Arab region.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.