2020
DOI: 10.1007/s11042-020-09430-3
|View full text |Cite
|
Sign up to set email alerts
|

Emotional quantification of soundscapes by learning between samples

Abstract: Predicting the emotional responses of humans to soundscapes is a relatively recent field of research coming with a wide range of promising applications. This work presents the design of two convolutional neural networks, namely ArNet and ValNet, each one responsible for quantifying arousal and valence evoked by soundscapes. We build on the knowledge acquired from the application of traditional machine learning techniques on the specific domain, and design a suitable deep learning framework. Moreover, we propos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 21 publications
1
6
0
Order By: Relevance
“…Recent studies with EMO have shown that more sophisticated nonlinear models (such as RF) can reach good scores with 15 features for arousal (MSE ≈ 0.050) and 14 features for valence (MSE ≈ 0.140). Finally, other authors using other complex nonlinear models, such as CNNs and data augmentation techniques, obtain slightly better metrics (MSE ≈ 0.035 for arousal, and MSE ≈ 0.078 for valence), but also including substantially more variables in their models: from 23 up to 54 features [11], [31]. All these considerations confirm the quality of our suggested models.…”
Section: B Selection Of the Number Of Variables And Suggested Model F...supporting
confidence: 76%
See 1 more Smart Citation
“…Recent studies with EMO have shown that more sophisticated nonlinear models (such as RF) can reach good scores with 15 features for arousal (MSE ≈ 0.050) and 14 features for valence (MSE ≈ 0.140). Finally, other authors using other complex nonlinear models, such as CNNs and data augmentation techniques, obtain slightly better metrics (MSE ≈ 0.035 for arousal, and MSE ≈ 0.078 for valence), but also including substantially more variables in their models: from 23 up to 54 features [11], [31]. All these considerations confirm the quality of our suggested models.…”
Section: B Selection Of the Number Of Variables And Suggested Model F...supporting
confidence: 76%
“…In [1], a fine-tuned RF model with 14 features overcomes the previous RF model, and convolution neural networks (CNNs). Deep learning techniques have been also applied to SER through CNN and 23 simplified mel-frequency cepstral coefficients (MFCC) in [31], and the combination with SVM (Transfer learning) in [11]. Promising results use up to 54 features by heuristic methods despite the limited samples of EMO.…”
Section: Introductionmentioning
confidence: 99%
“…The authors used two sets of techniques to extract features. The first method used a pretrained deep neural network created by S. Hershey et al [50], whereas the second method involved 54 features extracted using MIRToolbox and YAAFE. The best performance for arousal was reported with the CNN with an R 2 and MSE of 0.832 and 0.035, respectively, whereas the best performance for valence was reported to have an R 2 of 0.759 and MSE of 0.078 via VGGish (a deep CNN model).…”
Section: Sound Emotion Recognitionmentioning
confidence: 99%
“…-n_estimators : (50,100,150,200,250,300), number of trees in the forests; -max_depth : (5,10,20,30,50), maximum number of levels in each decision tree; -min_samples_split : (2, 3, 4, 5, 6, 7), minimum number of data points placed in a node before the node is split; -min_samples_lea f : (1, 2, 3, 5), minimum number of data points allowed in a leaf node; -k : range(1, 68), number of features selected using RFE with the RF estimator.…”
Section: Hyper-parameter Tuningmentioning
confidence: 99%
“…Ntalampiras [15] provided a comparison between emotion prediction from singleton soundscapes and mixed soundscapes using a CNN model. The author used Emo-Soundscape dataset and extracted the features from sound samples using log-Mel spectrum [16] which is a spectrogram that the frequencies are converted to the Mel scale.…”
Section: Related Workmentioning
confidence: 99%