Interrater sleep stage scoring reliability between manual scoring from two European sleep centers and automatic scoring performed by the artificial intelligence–based Stanford-STAGES algorithm

Cesari, Matteo; Stefani, Ambra; Penzel, Thomas; Ibrahim, Abubaker; Hackner, Heinz; Heidbreder, Anna; Szentkirályi, András; Stubbe, Beate; Völzke, Henry; Berger, Klaus; Högl, Birgit

doi:10.5664/jcsm.9174

Cited by 30 publications

(31 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, it is not possible to evaluate possible differences between local and external database generalization using kappa as reference. Very recently, however, generalization of the same algorithm was evaluated on two additional external datasets, in this case reporting a combined average performance of κ = 0.61, almost in line with the reference human levels in the corresponding cohort (κ = 0.66) [ 23 ], but underperforming with respect to the original values reported in [ 21 ] (κ = 0.72–0.77).…”

Section: Analysis Of Experimental Datamentioning

confidence: 91%

“…When considering data on the external dataset validation, Table 6 shows a general global decrease in the performance of the automatic methods as with respect to the corresponding indices on the local database validation scenario. Specifically, in all the works that allow comparison between local and external database generalization using the same algorithm [ 20 , 23 , 24 , 62 , 64 ] decrease in performance in noticeable when tested using external independent datasets. This trend is consistent with the results of our experimentation, as well as with data regarding human inter-rater agreement analyzed in Table 5 .…”

Section: Analysis Of Experimental Datamentioning

confidence: 99%

“…Numerous attempts have followed since then and up to now [ 5 – 14 ], evidencing that the task still represents a challenge, and an open area of research interest. More recently, several approximations have been appearing based on the use of deep learning, claiming advantages over previous realizations which include improved performance, and the possibility to skip handcrafted feature engineering processes [ 15 – 23 ]. However, despite the promising results reported in some of these works, practical acceptance of these systems among the clinical community remains low.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Inter-database validation of a deep learning approach for automatic sleep scoring

Álvarez-Estévez

Rijsman

2021

PLoS ONE

View full text Add to dashboard Cite

Study objectives Development of inter-database generalizable sleep staging algorithms represents a challenge due to increased data variability across different datasets. Sharing data between different centers is also a problem due to potential restrictions due to patient privacy protection. In this work, we describe a new deep learning approach for automatic sleep staging, and address its generalization capabilities on a wide range of public sleep staging databases. We also examine the suitability of a novel approach that uses an ensemble of individual local models and evaluate its impact on the resulting inter-database generalization performance. Methods A general deep learning network architecture for automatic sleep staging is presented. Different preprocessing and architectural variant options are tested. The resulting prediction capabilities are evaluated and compared on a heterogeneous collection of six public sleep staging datasets. Validation is carried out in the context of independent local and external dataset generalization scenarios. Results Best results were achieved using the CNN_LSTM_5 neural network variant. Average prediction capabilities on independent local testing sets achieved 0.80 kappa score. When individual local models predict data from external datasets, average kappa score decreases to 0.54. Using the proposed ensemble-based approach, average kappa performance on the external dataset prediction scenario increases to 0.62. To our knowledge this is the largest study by the number of datasets so far on validating the generalization capabilities of an automatic sleep staging algorithm using external databases. Conclusions Validation results show good general performance of our method, as compared with the expected levels of human agreement, as well as to state-of-the-art automatic sleep staging methods. The proposed ensemble-based approach enables flexible and scalable design, allowing dynamic integration of local models into the final ensemble, preserving data locality, and increasing generalization capabilities of the resulting system at the same time.

show abstract

Section: Analysis Of Experimental Datamentioning

confidence: 91%

Section: Analysis Of Experimental Datamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Inter-database validation of a deep learning approach for automatic sleep scoring

Álvarez-Estévez

Rijsman

2021

PLoS ONE

View full text Add to dashboard Cite

show abstract

“…Sleep technicians must verify each epoch manually to perform the sleep scoring, and it has limitations such as labour-intensive and time-consuming and inter-rater variability ( 25 ). Kappa (κ) measures the manual sleep scoring performance to estimate interrater reliability, representing an agreement between epoch-to-epoch.…”

Section: Role Of Ai In Sleep Stage Classificationmentioning

confidence: 99%

“…Traditionally more than one technician is involved in this process to avoid biases in marking sleep stages. The accuracy of sleep scoring depends on the expertise of the technicians ( 25 ). Although PSG use in clinical sleep medicine has significant benefits, the high cost is a barrier to its accessibility to many populations.…”

Section: Introductionmentioning

confidence: 99%