Decomposition Methods for Machine Learning with Small, Incomplete or Noisy Datasets

Caiafa, César F.; Solé-Casals, Jordi; Martí-Puig, Pere; Sun, Zhe; Tanaka, Tomomi

doi:10.3390/app10238481

Cited by 14 publications

(14 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These interpolated values were tracked, and never exceeded 3% of the full data. For future studies having data sets with a larger percentage of missing data, it is advisable to apply a data decomposition method to fill missing values [43]. The meteorological data were obtained from the Visual Crossing Weather application program interface [36], using weather station GK2 with coordinates (28°31 12.0 N 77°15 00.0 E).…”

Section: Methods and Experimental Designmentioning

confidence: 99%

A Comparison of Machine Learning Methods to Forecast Tropospheric Ozone Levels in Delhi

Juarez¹,

Petersen

2021

Atmosphere

View full text Add to dashboard Cite

Ground-level ozone is a pollutant that is harmful to urban populations, particularly in developing countries where it is present in significant quantities. It greatly increases the risk of heart and lung diseases and harms agricultural crops. This study hypothesized that, as a secondary pollutant, ground-level ozone is amenable to 24 h forecasting based on measurements of weather conditions and primary pollutants such as nitrogen oxides and volatile organic compounds. We developed software to analyze hourly records of 12 air pollutants and 5 weather variables over the course of one year in Delhi, India. To determine the best predictive model, eight machine learning algorithms were tuned, trained, tested, and compared using cross-validation with hourly data for a full year. The algorithms, ranked by R2 values, were XGBoost (0.61), Random Forest (0.61), K-Nearest Neighbor Regression (0.55), Support Vector Regression (0.48), Decision Trees (0.43), AdaBoost (0.39), and linear regression (0.39). When trained by separate seasons across five years, the predictive capabilities of all models increased, with a maximum R2 of 0.75 during winter. Bidirectional Long Short-Term Memory was the least accurate model for annual training, but had some of the best predictions for seasonal training. Out of five air quality index categories, the XGBoost model was able to predict the correct category 24 h in advance 90% of the time when trained with full-year data. Separated by season, winter is considerably more predictable (97.3%), followed by post-monsoon (92.8%), monsoon (90.3%), and summer (88.9%). These results show the importance of training machine learning methods with season-specific data sets and comparing a large number of methods for specific applications.

show abstract

Section: Methods and Experimental Designmentioning

confidence: 99%

A Comparison of Machine Learning Methods to Forecast Tropospheric Ozone Levels in Delhi

Juarez¹,

Petersen

2021

Atmosphere

View full text Add to dashboard Cite

show abstract

“…Four contributions proposed general methods for machine learning with low-quality datasets. In [1], the authors provided a unified review of decomposition methods, which includes linear decomposition, low-rank matrix/tensor factorization, sparse matrix/tensor decomposition and empirical mode decomposition (EMD) models. This paper illustrates the ability of these decomposition models to impute missing features, denoising and to artificially generate additional data samples (data augmentation) with examples to the brain-computer interface (BCI) and epileptic EEG analysis, among others.…”

Section: Methodological Articlesmentioning

confidence: 99%

“…Three papers addressed different problems or diseases in Neuroscience. For example, in [1], Caiafa et al (Argentina-Spain-Japan) reviewed recent approaches to deal with incomplete or noisy measurements by applying signal decomposition methods and showed their usefulness in epileptic intracranial electroencephalogram (iEEG) signals classification, among other applications. Finding epileptic focus with iEEG is usually difficult mainly because available datasets labeled by expert medical doctors are scarce.…”

Section: Medical Applicationsmentioning

confidence: 99%

“…The authors proposed a data augmentation technique by introducing changes in pixels in face images associated with variations by extracting the binary weighted interpolation map (B-WIM) from neutral and variational images in the auxiliary set. In [1], the EMD method was applied to remove noise in face images, thus improving the classification accuracy of a machine learning classifier. Finally, in [15], Mouratidis et al (Greece) provided an application to natural language processing.…”

Section: Other Applicationsmentioning

confidence: 99%

See 1 more Smart Citation

Machine Learning Methods with Noisy, Incomplete or Small Datasets

Caiafa

Sun²,

Tanaka

et al. 2021

Applied Sciences

Self Cite

View full text Add to dashboard Cite

In this article, we present a collection of fifteen novel contributions on machine learning methods with low-quality or imperfect datasets, which were accepted for publication in the special issue “Machine Learning Methods with Noisy, Incomplete or Small Datasets”, Applied Sciences (ISSN 2076-3417). These papers provide a variety of novel approaches to real-world machine learning problems where available datasets suffer from imperfections such as missing values, noise or artefacts. Contributions in applied sciences include medical applications, epidemic management tools, methodological work, and industrial applications, among others. We believe that this special issue will bring new ideas for solving this challenging problem, and will provide clear examples of application in real-world scenarios.

show abstract

“…The deep learning technology is notable for its impressive performance and generalization capability, but the number of effective samples in the medical imaging dataset is usually small, leading to performance degradation. The training model needs large amount of data to avoid overfitting (Caiafa et al, 2020 ). However, obtaining enough MRI data is not easy.…”

Section: Introductionmentioning

confidence: 99%

Graph Empirical Mode Decomposition-Based Data Augmentation Applied to Gifted Children MRI Analysis

Chen

Hao

et al. 2022

Front. Neurosci.

Self Cite

View full text Add to dashboard Cite

Gifted children and normal controls can be distinguished by analyzing the structural connectivity (SC) extracted from MRI data. Previous studies have improved classification accuracy by extracting several features of the brain regions. However, the limited size of the database may lead to degradation when training deep neural networks as classification models. To this end, we propose to use a data augmentation method by adding artificial samples generated using graph empirical mode decomposition (GEMD). We decompose the training samples by GEMD to obtain the intrinsic mode functions (IMFs). Then, the IMFs are randomly recombined to generate the new artificial samples. After that, we use the original training samples and the new artificial samples to enlarge the training set. To evaluate the proposed method, we use a deep neural network architecture called BrainNetCNN to classify the SCs of MRI data with and without data augmentation. The results show that the data augmentation with GEMD can improve the average classification performance from 55.7 to 78%, while we get a state-of-the-art classification accuracy of 93.3% by using GEMD in some cases. Our results demonstrate that the proposed GEMD augmentation method can effectively increase the limited number of samples in the gifted children dataset, improving the classification accuracy. We also found that the classification accuracy is improved when specific features extracted from brain regions are used, achieving 93.1% for some feature selection methods.

show abstract

Decomposition Methods for Machine Learning with Small, Incomplete or Noisy Datasets

Cited by 14 publications

References 56 publications

A Comparison of Machine Learning Methods to Forecast Tropospheric Ozone Levels in Delhi

A Comparison of Machine Learning Methods to Forecast Tropospheric Ozone Levels in Delhi

Machine Learning Methods with Noisy, Incomplete or Small Datasets

Graph Empirical Mode Decomposition-Based Data Augmentation Applied to Gifted Children MRI Analysis

Contact Info

Product

Resources

About