Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Aldrich, Chris

doi:10.3390/min10050420

Cited by 42 publications

(15 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Furthermore, the Naive Bayes and Random Forest classifiers were selected to measure the importance of the features. Random Forest creates a forest of trees, and per tree measures a candidate feature’s ability to optimally split the instances into two classes using the Gini impurity [ 55 ]. Naive Bayes calculates the probability of each feature

in order to evaluate their performance at predicting the output variable.…”

Section: Materials and Methodsmentioning

confidence: 99%

Data-Driven Machine-Learning Methods for Diabetes Risk Prediction

Δρίτσας

Trigka

2022

Sensors

View full text Add to dashboard Cite

Diabetes mellitus is a chronic condition characterized by a disturbance in the metabolism of carbohydrates, fats and proteins. The most characteristic disorder in all forms of diabetes is hyperglycemia, i.e., elevated blood sugar levels. The modern way of life has significantly increased the incidence of diabetes. Therefore, early diagnosis of the disease is a necessity. Machine Learning (ML) has gained great popularity among healthcare providers and physicians due to its high potential in developing efficient tools for risk prediction, prognosis, treatment and the management of various conditions. In this study, a supervised learning methodology is described that aims to create risk prediction tools with high efficiency for type 2 diabetes occurrence. A features analysis is conducted to evaluate their importance and explore their association with diabetes. These features are the most common symptoms that often develop slowly with diabetes, and they are utilized to train and test several ML models. Various ML models are evaluated in terms of the Precision, Recall, F-Measure, Accuracy and AUC metrics and compared under 10-fold cross-validation and data splitting. Both validation methods highlighted Random Forest and K-NN as the best performing models in comparison to the other models.

show abstract

in order to evaluate their performance at predicting the output variable.…”

Section: Materials and Methodsmentioning

confidence: 99%

Data-Driven Machine-Learning Methods for Diabetes Risk Prediction

Δρίτσας

Trigka

2022

Sensors

View full text Add to dashboard Cite

show abstract

“…Tüm veriler 𝑁 değişkeni ile temsil edilir ve seçilmiş veri ise 𝑛 ile temsil edilir. Ayrıca, 𝑝 𝑖 değişkeni seçilmiş verinin kendisinden küçük ve kendisinden büyük eleman sayısına bölümünün karesini temsil eder [16].…”

Section: Rastgele Orman Yöntemiunclassified

Web Sitelerinde Gerçekleştirilen Oltalama Saldırılarının Yapay Zekâ Yaklaşımı ile Tespiti

Toğaçar

2021

Bitlis Eren Üniversitesi Fen Bilimleri Dergisi

View full text Add to dashboard Cite

ÖzOltalama, kişisel bilgilerin internet üzerinden çalınmasına yönelik gerçekleştirilen yazılım tabanlı saldırılardır. Oltalama saldırılarında genellikle kişilerin kimlik bilgileri, kullanıcı parolaları, kredi veya banka kartı bilgileri gibi özel bilgilerin ele geçirilmesi amaçlanır. Bunun için en uygun ortam olarak genelde özel yazılım kodları içeren web sitesi uygulamaları veya elektronik posta sistemleri tercih edilir. Bu tür net uygulamalarında gelen cezbedici görsel veya metin tabanlı iletiler bireyleri yemleyerek saldırıların gerçekleştirilmesini sağlar. Milyarlarca insanın etkileşim içerisinde olduğu internet ortamında bu tür saldırıların önlemini zamanında alabilmek için teknolojik gelişmelerle paralel hareket etmek gerekir. Son zamanlarda, yapay zekâ teknolojileri internet güvenliği alanında adını duyurmayı başarmıştır. Bu çalışmada, makine öğrenme yöntemleri ile 11 binin üzerinde web sitesi incelenmiş ve oltalama saldırısı yapan web siteleri tespit edildi. Veri seti, 30 web parametresinden oluşmaktadır ve açık erişimlidir. Makine öğrenmesi yöntemleri ile her bir web sitesi için 30 özellik incelendi; oltalama saldırısını gerçekleştiren web siteleri ile gerçekleştirmeyen web siteleri sınıflandırıldı. Sonuç olarak, en iyi test doğruluk başarısı Rastgele Orman yöntemi ile %96,53 oranında gerçekleştirildi.

show abstract

“…However, the Gini importance has a drawback. It is known to be biased towards input variables with continuous and discrete variable with high cardinality (Zhou and Hooker 2021 ; Aldrich 2020 ; Gómez-Ramírez et al 2020 ), as these variables provide high possibilities for tree splitting. To address this issue, Lundberg and Lee ( 2017 ) propose a method that is based on Shapley values (Hur et al 2017 ; Aldrich 2020 ).…”

Section: Post-modeling Analysismentioning

confidence: 99%

“…It is known to be biased towards input variables with continuous and discrete variable with high cardinality (Zhou and Hooker 2021 ; Aldrich 2020 ; Gómez-Ramírez et al 2020 ), as these variables provide high possibilities for tree splitting. To address this issue, Lundberg and Lee ( 2017 ) propose a method that is based on Shapley values (Hur et al 2017 ; Aldrich 2020 ). Stemming from game theory, Shapley values provide a theoretically justified way to fairly allocate a coalition’s output among members in the coalition (Shapley 1953 ).…”

Section: Post-modeling Analysismentioning

confidence: 99%

See 1 more Smart Citation

Modeling household online shopping demand in the U.S.: a machine learning approach and comparative investigation between 2009 and 2017

et al. 2021

View full text Add to dashboard Cite

Despite the rapid growth of online shopping and research interest in the relationship between online and in-store shopping, national-level modeling and investigation of the demand for online shopping with a prediction focus remain limited in the literature. This paper differs from prior work and leverages two recent releases of the U.S. National Household Travel Survey (NHTS) data for 2009 and 2017 to develop machine learning (ML) models, specifically gradient boosting machine (GBM), for predicting household-level online shopping purchases. The NHTS data allow for not only conducting nationwide investigation but also at the level of households, which is more appropriate than at the individual level given the connected consumption and shopping needs of members in a household. We follow a systematic procedure for model development including employing Recursive Feature Elimination algorithm to select input variables (features) in order to reduce the risk of model overfitting and increase model explainability. Among several ML models, GBM is found to yield the best prediction accuracy. Extensive post-modeling investigation is conducted in a comparative manner between 2009 and 2017, including quantifying the importance of each input variable in predicting online shopping demand, and characterizing value-dependent relationships between demand and the input variables. In doing so, two latest advances in machine learning techniques, namely Shapley value-based feature importance and Accumulated Local Effects plots, are adopted to overcome inherent drawbacks of the popular techniques in current ML modeling. The modeling and investigation are performed at the national level, with a number of findings obtained. The models developed and insights gained can be used for online shopping-related freight demand generation and may also be considered for evaluating the potential impact of relevant policies on online shopping demand.

show abstract

Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Cited by 42 publications

References 32 publications

Data-Driven Machine-Learning Methods for Diabetes Risk Prediction

Data-Driven Machine-Learning Methods for Diabetes Risk Prediction

Web Sitelerinde Gerçekleştirilen Oltalama Saldırılarının Yapay Zekâ Yaklaşımı ile Tespiti

Modeling household online shopping demand in the U.S.: a machine learning approach and comparative investigation between 2009 and 2017

Contact Info

Product

Resources

About