Earlier detection of individuals at the highest risk of developing diabetes is crucial to avoid the disease's prevalence and progression. Therefore, we aim to build a data-driven predictive application for screening subjects at a high risk of developing Type 2 Diabetes mellitus (T2DM) in the western region of Saudi Arabia. In this context, we designed and implemented a questionnairebased cross-sectional study using conventional diabetes risk factors for studying the prevalence and the association between the outcomes and exposure (s). We used the Chi-Squared test and binary logistic regression to analyze and screen the most significant diabetes risk factor for T2DM risk prediction. Synthetic Minority Over-sampling Technique (SMOTE), a class-balancer, was used to balance the cross-sectional data. We used the balanced class data to screen the best performing classification algorithm to classify patients at high risk of diabetes with a higher F1 Score. The best performing classifier's hyper-parameters were further tuned using 10-fold cross-validation for achieving an improved F1 Score.Additionally, we validated our proposed model with the existing models built using the National Health and Nutrition Examination Survey (NHANES) dataset and Pima Indian Diabetes (PID) dataset. The results of the Chi-squared test and binary logistic regression showed that the exposures, namely Smoking, Healthy diet, Blood-Pressure (BP), Body Mass Index (BMI), Gender, and Region, contributed significantly (p < 0.05) to the prediction of the Response variable (subjects at high risk of diabetes). The tuned two-class Decision Forest (DF) model showed better performance with an average F1score of 0.8453 ± 0.0268. Moreover, the DF based model adapted reasonably well in different diabetes dataset. An Application Programming Interface (API) of the tuned DF model was implemented and deployed as a web service at https://type2-diabetes-riskpredictor.herokuapp.com, and the implementation codes are available at https://github.com/SAH-ML/T2DM-Risk-Predictor.
The fact that ensemble methods enhance the prediction performance. Therefore, we focused on developing a weighted ensemble method using a novel combination of Cerebrospinal Fluid (CSF) protein biomarkers to predict AD's earlier stages with greater accuracy than the stateof-the-art CSF protein biomarkers. In this regard, two feature selection methods, namely the Recursive Feature Elimination (RFE) and L1 regularization method were used to screen the most important subset of features for building a classification model using the Mild Cognitive Impairment (MCI) dataset. A novel combination of three biomarkers, namely Cystatin C, Matrix metalloproteinases (MMP10), and tau protein, was screened using the linear Support Vector Machine (SVM) and Logistic Regression (LR) classifier based RFE method. Two-tailed unpaired t-test analysis at a 5 % significance level showed a significant difference between the mean levels of Cystatin C, MMP10, and tau protein between cognitive normal and cognitively impaired groups. An ensemble model using a weighted average of two best performing classifiers (LR and Linear SVM) was created using a novel subset of three most informative features. Our ensemble model's weighted average results performed significantly better than LR and Linear SVM base classifiers' performance. The Receiver Operating Characteristic Curve (ROC_AUC) and Area under Precision-Recall values (AUPR) of our proposed model were observed to be 0.9799 ± 0.055 0.9108 ± 0.015, respectively. The performance of our proposed weighted averaged ensemble model built using a novel combination of CSF protein biomarkers was significantly better (p < 0.001) than models generated using different combinations of CSF protein biomarkers obtained from recent studies. An ensemble-learning based application was implemented and deployed at Heroku at https://appsalzheimer.herokuapp.com.
The high dimensionality and sparsity of the microarray gene expression data make it challenging to analyze and screen the optimal subset of genes as predictors of breast cancer (BC). The authors in the present study propose a novel hybrid Feature Selection (FS) sequential framework involving minimum Redundancy-Maximum Relevance (mRMR), a two-tailed unpaired t-test, and meta-heuristics to screen the most optimal set of gene biomarkers as predictors for BC. The proposed framework identified a set of three most optimal gene biomarkers, namely, MAPK 1, APOBEC3B, and ENAH. In addition, the state-of-the-art supervised Machine Learning (ML) algorithms, namely Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Neural Net (NN), Naïve Bayes (NB), Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), and Logistic Regression (LR) were used to test the predictive capability of the selected gene biomarkers and select the most effective breast cancer diagnostic model with higher values of performance matrices. Our study found that the XGBoost-based model was the superior performer with an accuracy of 0.976 ± 0.027, an F1-Score of 0.974 ± 0.030, and an AUC value of 0.961 ± 0.035 when tested on an independent test dataset. The screened gene biomarkers-based classification system efficiently detects primary breast tumors from normal breast samples.
Machine Translation, Information Retrieval and Knowledge Acquisition are the three main applications of Word Sense Disambiguation (WSD). The sense of a target word can be identified from a dictionary using a 'bag of words', i.e. neighbours of the target word. A target word has the same spelling of the word but with a different meaning, i.e. chair, light etc. In WSD, the key input sources are sentences and target words. But, instead of providing a target word, this should automatically be detected. If a sentence has more than one target word, then the filtration process will require further processing. In this study, the proposed framework, consisting of buzz words and query words has been developed to detect target words using the WordNet dictionary. Buzz words are defined as a 'bag-ofwords' using POS-Tags, and query words are those words having multiple meanings. The proposed framework will endeavor to find the sense of the detected target word using its gloss and with examples containing buzz words. This is a semi-supervised approach because 266 words of multiple meanings have been labelled from various sources and used based on an unsupervised approach to detect the target word and sense (meaning). After experimenting on a dataset consisting of 300 hotel reviews, 100 % of the target words for each sentence were detected with 84 % related to the sense of each sentence or phrase.
The increase in coronavirus disease 2019 (COVID-19) infection caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has placed pressure on healthcare services worldwide. Therefore, it is crucial to identify critical factors for the assessment of the severity of COVID-19 infection and the optimization of an individual treatment strategy. In this regard, the present study leverages a dataset of blood samples from 485 COVID-19 individuals in the region of Wuhan, China to identify essential blood biomarkers that predict the mortality of COVID-19 individuals. For this purpose, a hybrid of filter, statistical, and heuristic-based feature selection approach was used to select the best subset of informative features. As a result, minimum redundancy maximum relevance (mRMR), a two-tailed unpaired t-test, and whale optimization algorithm (WOA) were eventually selected as the three most informative blood biomarkers: International normalized ratio (INR), platelet large cell ratio (P-LCR), and D-dimer. In addition, various machine learning (ML) algorithms (random forest (RF), support vector machine (SVM), extreme gradient boosting (EGB), naïve Bayes (NB), logistic regression (LR), and k-nearest neighbor (KNN)) were trained. The performance of the trained models was compared to determine the model that assist in predicting the mortality of COVID-19 individuals with higher accuracy, F1 score, and area under the curve (AUC) values. In this paper, the best performing RF-based model built using the three most informative blood parameters predicts the mortality of COVID-19 individuals with an accuracy of 0.96 ± 0.062, F1 score of 0.96 ± 0.099, and AUC value of 0.98 ± 0.024, respectively on the independent test data. Furthermore, the performance of our proposed RF-based model in terms of accuracy, F1 score, and AUC was significantly better than the known blood biomarkers-based ML models built using the Pre_Surv_COVID_19 data. Therefore, the present study provides a novel hybrid approach to screen the most informative blood biomarkers to develop an RF-based model, which accurately and reliably predicts in-hospital mortality of confirmed COVID-19 individuals, during surge periods. An application based on our proposed model was implemented and deployed at Heroku.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.