Machine learning (ML) is becoming a standard tool in neuroscience and neuroimaging research. Yet, because it is such a powerful tool, the appropriate application of ML requires a sound understanding of its subtleties and limitations. In particular, applying ML to datasets with imbalanced classes, which are very common in neuroscience, can have severe consequences if not adequately addressed. With the neuroscience machine-learning user in mind, this technical note provides a didactic overview of the class imbalance problem and illustrates its impact through systematic manipulation of class imbalance ratios in both simulated data, and real electroencephalography (EEG) and magnetoencephalography (MEG) brain data. Our results illustrate how in highly imbalanced data, the commonly used Accuracy (Acc) metric yields misleadingly high performances by preferentially predicting the majority class, while other evaluations metrics (e.g. Balanced Accuracy (BAcc) and the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC)) may still provide reliable performance evaluations. In terms of classifiers and cross-validation schemes, our data highlights the higher robustness of Random Forest (RF) and Stratified K-Fold crossvalidation, compared to the other approaches tested. Critically, for neuroscience ML applications that seek to minimize overall classification error (not preferentially that of a single class), we recommend the routine use of BAcc, rather than the simple and more commonly used Acc metric. Importantly, we provide a best practices list of recommendations for dealing with imbalanced data, and open-source code to allow the neuroscience community to replicate our observations and further explore the best practices in handling imbalanced data.
Neuroimaging data analysis often requires purpose-built software, which can be challenging to install and may produce different results across computing environments. Beyond being a roadblock to neuroscientists, these issues of accessibility and portability can hamper the reproducibility of neuroimaging data analysis pipelines. Here, we introduce the Neurodesk platform, which harnesses software containers to support a comprehensive and growing suite of neuroimaging software (https://www.neurodesk.org/). Neurodesk includes a browser-accessible virtual desktop environment and a command line interface, mediating access to containerized neuroimaging software libraries on various computing platforms, including personal and high-performance computers, cloud computing and Jupyter Notebooks. This community-oriented, open-source platform enables a paradigm shift for neuroimaging data analysis, allowing for accessible, flexible, fully reproducible, and portable data analysis pipelines.
In quantitative electroencephalography, it is of vital importance to eliminate non-neural components, as these can lead to an erroneous analysis of the acquired signals, limiting their use in diagnosis and other clinical applications. In light of this drawback, preprocessing pipelines based on the joint use of the Wavelet Transform and the Independent Component Analysis technique (wICA) were proposed in the 2000s. Recently, with the advent of data-driven methods, deep learning models were developed for the automatic labeling of independent components, which constitutes an opportunity for the optimization of ICA-based techniques. In this paper, ICLabel, one of these deep learning models, was added to the wICA methodology in order to explore its improvement. To assess the usefulness of this approach, it was compared to different pipelines which feature the use of wICA and ICLabel independently and a lack thereof. The impact of each pipeline was measured by its capacity to highlight known statistical differences between asymptomatic carriers of the PSEN-1 E280A mutation and a healthy control group. Specifically, the between-group effect size and the P-values were calculated to compare the pipelines. The results show that using ICLabel for artifact removal can improve the effect size (ES) and that, by leveraging it with wICA, an artifact smoothing approach that is less prone to the loss of neural information can be built.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.