Replica analysis of overfitting in generalized linear regression models

Coolen, A. C. C.; Sheikh, Mansoor; Mozeika, Alexander; López, Fabián Aguirre; Antenucci, Fabrizio

doi:10.1088/1751-8121/aba028

Cited by 13 publications

(34 citation statements)

References 93 publications

(134 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This may be due to the redundant information generated by the accumulation of features from multiple growth stages. In addition, the excessive dimensionality of input features also poses the risk of overfitting the machine learning model ( Feng et al, 2017 ; Coolen et al, 2020 ). Among the combinations of EMF, the prediction accuracy of C2 was comparable to a combination with the highest prediction accuracy of C4.…”

Section: Discussionmentioning

confidence: 99%

Entropy Weight Ensemble Framework for Yield Prediction of Winter Wheat Under Different Water Stress Treatments Using Unmanned Aerial Vehicle-Based Multispectral and Thermal Data

Fei

Hassan

et al. 2021

Front. Plant Sci.

View full text Add to dashboard Cite

Crop breeding programs generally perform early field assessments of candidate selection based on primary traits such as grain yield (GY). The traditional methods of yield assessment are costly, inefficient, and considered a bottleneck in modern precision agriculture. Recent advances in an unmanned aerial vehicle (UAV) and development of sensors have opened a new avenue for data acquisition cost-effectively and rapidly. We evaluated UAV-based multispectral and thermal images for in-season GY prediction using 30 winter wheat genotypes under 3 water treatments. For this, multispectral vegetation indices (VIs) and normalized relative canopy temperature (NRCT) were calculated and selected by the gray relational analysis (GRA) at each growth stage, i.e., jointing, booting, heading, flowering, grain filling, and maturity to reduce the data dimension. The elastic net regression (ENR) was developed by using selected features as input variables for yield prediction, whereas the entropy weight fusion (EWF) method was used to combine the predicted GY values from multiple growth stages. In our results, the fusion of dual-sensor data showed high yield prediction accuracy [coefficient of determination (R2) = 0.527–0.667] compared to using a single multispectral sensor (R2 = 0.130–0.461). Results showed that the grain filling stage was the optimal stage to predict GY with R2 = 0.667, root mean square error (RMSE) = 0.881 t ha–1, relative root-mean-square error (RRMSE) = 15.2%, and mean absolute error (MAE) = 0.721 t ha–1. The EWF model outperformed at all the individual growth stages with R2 varying from 0.677 to 0.729. The best prediction result (R2 = 0.729, RMSE = 0.831 t ha–1, RRMSE = 14.3%, and MAE = 0.684 t ha–1) was achieved through combining the predicted values of all growth stages. This study suggests that the fusion of UAV-based multispectral and thermal IR data within an ENR-EWF framework can provide a precise and robust prediction of wheat yield.

show abstract

Section: Discussionmentioning

confidence: 99%

Entropy Weight Ensemble Framework for Yield Prediction of Winter Wheat Under Different Water Stress Treatments Using Unmanned Aerial Vehicle-Based Multispectral and Thermal Data

Fei

Hassan

et al. 2021

Front. Plant Sci.

View full text Add to dashboard Cite

show abstract

“…The platform supports the deployment of Interpretable Artificial Intelligence (IAI) and Bayesian inference methods for rapid and scalable risk stratification of prostate cancer. These algorithms will include novel findings around overfitting of data ( Coolen et al, 2017 ; Coolen et al, 2020 ) and latent class models ( Rowley et al, 2017 ) which will help us to stratify patients more correctly.…”

Section: Methodsmentioning

confidence: 99%

“…To address these combined challenges of high data dimensionality, covariate disparity, and latent cohort heterogeneity, we build a data analytics pipeline (based on the libraries underlying the SaddlePoint-Signature and SaddlePoint-Mosaics software packages https://www.saddlepointscience.com/ ) which combine cross-validation protocols, optimisation tools for covariate selection, and modern mathematical techniques with which to “decontaminate” regression outcomes for the effects of overfitting [see e.g. ( Coolen et al, 2017 ; Sheikh and Coolen, 2019 ; Coolen et al, 2020 )], with the use of modality-specific “meta-covariates”. The latter are personalised and optimised modality-specific risk scores (decontaminated for overfitting), which are subsequently used as integrated digital biomarkers that capture the relevant predictive information in each of the data sources.…”

Section: Methodsmentioning

confidence: 99%

The ReIMAGINE Multimodal Warehouse: Using Artificial Intelligence for Accurate Risk Stratification of Prostate Cancer

Santaolalla

Hulsen

Davis

et al. 2021

Front. Artif. Intell.

View full text Add to dashboard Cite

Introduction. Prostate cancer (PCa) is the most frequent cancer diagnosis in men worldwide. Our ability to identify those men whose cancer will decrease their lifespan and/or quality of life remains poor. The ReIMAGINE Consortium has been established to improve PCa diagnosis.Materials and methods. MRI will likely become the future cornerstone of the risk-stratification process for men at risk of early prostate cancer. We will, for the first time, be able to combine the underlying molecular changes in PCa with the state-of-the-art imaging. ReIMAGINE Screening invites men for MRI and PSA evaluation. ReIMAGINE Risk includes men at risk of prostate cancer based on MRI, and includes biomarker testing.Results. Baseline clinical information, genomics, blood, urine, fresh prostate tissue samples, digital pathology and radiomics data will be analysed. Data will be de-identified, stored with correlated mpMRI disease endotypes and linked with long term follow-up outcomes in an instance of the Philips Clinical Data Lake, consisting of cloud-based software. The ReIMAGINE platform includes application programming interfaces and a user interface that allows users to browse data, select cohorts, manage users and access rights, query data, and more. Connection to analytics tools such as Python allows statistical and stratification method pipelines to run profiling regression analyses. Discussion. The ReIMAGINE Multimodal Warehouse comprises a unique data source for PCa research, to improve risk stratification for PCa and inform clinical practice. The de-identified dataset characterized by clinical, imaging, genomics and digital pathology PCa patient phenotypes will be a valuable resource for the scientific and medical community.

show abstract

“…Penelitian yang dilakukan oleh ACC Coolen dengan judul "Replica analysis of overfitting in generalized linear regression models" menunjukkan hasil Derivasi yang hanya bergantung pada bentuk linear tergeneralisasi dari GLM dan saat memilih prior L2. Karena itu replika perhitungannya tidak perlu diulangi untuk setiap contoh model GLM baru; seperti biasa metode replika berfungsi sebagai kendaraan yang relatif tidak menyakitkan dan elegan tetapi kuat untuk sampai pada kumpulan persamaan parameter orde tertutup, bersama dengan rumus mengungkapkan hubungan antara penduga parameter ML / MAP dan benar nilai-nilai parameter ini [13].…”

Section: Pendahuluanunclassified

Analisis Performansi Algoritma Linear Regression dengan Generalized Linear Model untuk Prediksi Penjualan pada Usaha Mikra, Kecil, dan Menengah

Hamdanah¹,

Fitrianah²

2021

j. nas. pendidik. teknik. inform.

View full text Add to dashboard Cite

Penjualan merupakan syarat mutlak kelangsungan suatu usaha, karena dengan penjualan maka akan didapatkan keuntungan. Metode Linear Regression dan Generalized Linear Model merupakan metode pendekatan yang didukung dengan perhitungan RSME. RMSE (Root Mean Square Error) berfungsi untuk mendapatkan besaran tingkat kesalahan dari hasil prediksi, dimana semakin kecil (mendekati 0) nilai RMSE maka semakin akurat nilai prediksinya. Pada setiap Usaha Mikro Kecil Menengah (UMKM) aktivitas transaksi dan pelayanan terhadap konsumen setiap harinya semakin lama semakin meningkat, sehingga tanpa disadari hal ini dapat menimbulkan tumpukan data yang semakin membesar. UMKM biasanya mengeluarkan beberapa item berbeda untuk ditawarkan ke pasar dengan harga yang berbeda, namun tidak semua barang banyak peminatnya. Keberhasilan penjualannya menentukan keberlanjutan untuk umkm itu sendiri. Pada penelitian ini akan dibandingkan penggunaan algoritma Linear Regression dengan Generalized Linear Model yang diimplementasikan pada data penjualan yang sudah diinputkan sebelumnya guna menghasilkan prediksi penjualan barang untuk tahun berikutnya. Hasil perhitungan menunjukkan bahwa algoritma Linear Regression dengan nilai RSME, MSE,MAPE sebesar 1,983; 3,933; dan 1,518 sedangkan hasil dari algoritma Generalized Linear Model dengan nilai RSME, MSE, MAPE sebesar 4,827; 23,295; dan 3,882. Berdasarkan perhitungan prediksi oleh algoritma Linear Regression dan Generalized Linear Model dapat disimpulkan bahwa nilai RSME pada algoritma Linear Regression menunjukkan perhitungan paling baik dikarenakan nilai RSME paling kecil.

show abstract

Replica analysis of overfitting in generalized linear regression models

Cited by 13 publications

References 93 publications

Entropy Weight Ensemble Framework for Yield Prediction of Winter Wheat Under Different Water Stress Treatments Using Unmanned Aerial Vehicle-Based Multispectral and Thermal Data

Entropy Weight Ensemble Framework for Yield Prediction of Winter Wheat Under Different Water Stress Treatments Using Unmanned Aerial Vehicle-Based Multispectral and Thermal Data

The ReIMAGINE Multimodal Warehouse: Using Artificial Intelligence for Accurate Risk Stratification of Prostate Cancer

Analisis Performansi Algoritma Linear Regression dengan Generalized Linear Model untuk Prediksi Penjualan pada Usaha Mikra, Kecil, dan Menengah

Contact Info

Product

Resources

About