Li Yan et al. reply

Li, Yan; Zhang, Haitao; Xiao, Yang; Wang, Maolin; Guo, Yuqi; Sun, Changhua; Tang, Xun; Cao, Zhiguo; Li, Shusheng; Xu, Hui; Cheng, Cheng; Jin, Junyang; Yuan, Ye

doi:10.1038/s42256-020-00251-5

Cited by 10 publications

(11 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…“Internal” model performance on structurally similar, previously unseen data, gathered from the same source used for model training, can be contrasted with “external” model performance on new, previously unseen data from other sources. ML models perform worse in external cohorts due to several reasons such as different protocols, confounding variables, or heterogeneous populations (Cabitza et al, 2017 ; Zech et al, 2018 ; Martensson et al, 2020 ; Goncalves et al, 2021 ). Moreover, medical data can be biased by a variety of factors such as admission policies, hospital treatment protocols, country-specific guidelines, clinician discretion, healthcare economy, etc.…”

Section: Discussionmentioning

confidence: 99%

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

et al. 2022

View full text Add to dashboard Cite

Machine learning (ML) models are developed on a learning dataset covering only a small part of the data of interest. If model predictions are accurate for the learning dataset but fail for unseen data then generalization error is considered high. This problem manifests itself within all major sub-fields of ML but is especially relevant in medical applications. Clinical data structures, patient cohorts, and clinical protocols may be highly biased among hospitals such that sampling of representative learning datasets to learn ML models remains a challenge. As ML models exhibit poor predictive performance over data ranges sparsely or not covered by the learning dataset, in this study, we propose a novel method to assess their generalization capability among different hospitals based on the convex hull (CH) overlap between multivariate datasets. To reduce dimensionality effects, we used a two-step approach. First, CH analysis was applied to find mean CH coverage between each of the two datasets, resulting in an upper bound of the prediction range. Second, 4 types of ML models were trained to classify the origin of a dataset (i.e., from which hospital) and to estimate differences in datasets with respect to underlying distributions. To demonstrate the applicability of our method, we used 4 critical-care patient datasets from different hospitals in Germany and USA. We estimated the similarity of these populations and investigated whether ML models developed on one dataset can be reliably applied to another one. We show that the strongest drop in performance was associated with the poor intersection of convex hulls in the corresponding hospitals' datasets and with a high performance of ML methods for dataset discrimination. Hence, we suggest the application of our pipeline as a first tool to assess the transferability of trained models. We emphasize that datasets from different hospitals represent heterogeneous data sources, and the transfer from one database to another should be performed with utmost care to avoid implications during real-world applications of the developed models. Further research is needed to develop methods for the adaptation of ML models to new hospitals. In addition, more work should be aimed at the creation of gold-standard datasets that are large and diverse with data from varied application sites.

show abstract

Section: Discussionmentioning

confidence: 99%

Application of convex hull analysis for the evaluation of data heterogeneity between patient populations of different origin and implications of hospital bias in downstream machine-learning-based data processing: A comparison of 4 critical-care patient datasets

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Oleh karena itu, sekarang ada kebutuhan yang kuat untuk teknik baru dan alat otomatis yang akan dirancang yang secara signifikan dapat membantu kami dalam menyiapkan data yang berkualitas. Persiapan data bisa lebih memakan waktu daripada data penambangan, dan dapat menghadirkan tantangan yang setara, jika tidak lebih, daripada penambangan data [8]. Pada bagian ini, kami memperdebatkan pentingnya persiapan data pada tiga aspek: 1.…”

Section: Pentingnya Persiapan Data Pada Data Miningunclassified

Prinsip Klasifikasi Dan Data Mining Dengan Algoritma C4.5

Senubekti¹,

Dewi

2022

NUANSA

View full text Add to dashboard Cite

Pertumbuhan yang cepat dan integrasi database memberikan ilmuwan, insinyur, dan pebisnis dengan sumber daya baru yang luas yang dapat dianalisis untuk membuat penemuan ilmiah, mengoptimalkan sistem industri, dan mengungkap pola yang berharga secara finansial. Untuk melakukan proyek analisis data besar ini, peneliti dan praktisi telah mengadopsi algoritme mapan dari statistik, pembelajaran mesin, jaringan saraf, dan basis data dan juga telah mengembangkan metode baru yang ditargetkan pada masalah data mining besar. Principles of Data Mining oleh David Hand, Heikki Mannila, dan Padhraic Smyth memberikan pengenalan kepada praktisi dan siswa tentang berbagai algoritma dan metodologi di area yang menarik ini. Pada penelitian ini digunakan algoritma C4.5. Sifat interdisipliner bidang ini cocok dengan ketiga penulis ini, yang keahliannya mencakup statistik, database, dan ilmu komputer. Hasilnya adalah sebuah buku yang tidak hanya memberikan detail teknis dan prinsip-prinsip matematika yang mendasari metode data mining, tetapi juga memberikan perspektif yang berharga tentang keseluruhan perusahaan.

show abstract

“…Few papers address this small-data issue, or the resulting imbalance of class sizes, making it unlikely that their results will generalize to the wider community. For example, because of the prevalence of data from China, many researchers train on small datasets from China when the model is intended for European populations, and recent research suggests such models are ineffective in practice (6). Differences between the training data and the target population, including patient phenotypes and data acquisition procedures, can all affect a model's generalisability (6).…”

Section: Systematic Errors In the Literaturementioning

confidence: 99%

“…For example, because of the prevalence of data from China, many researchers train on small datasets from China when the model is intended for European populations, and recent research suggests such models are ineffective in practice (6). Differences between the training data and the target population, including patient phenotypes and data acquisition procedures, can all affect a model's generalisability (6). Training generalisable models from small amounts of labeled data are a common problem in medical imaging, and techniques such as transfer learning, self-or semisupervised learning, and parameter pruning can ameliorate this issue (7,8).…”

Section: Systematic Errors In the Literaturementioning

confidence: 99%