2018 IEEE International Symposium on Multimedia (ISM) 2018
DOI: 10.1109/ism.2018.00-11
|View full text |Cite
|
Sign up to set email alerts
|

Deep Learning of Human Perception in Audio Event Classification

Abstract: In this paper, we introduce our recent studies on human perception in audio event classification by different deep learning models. In particular, the pre-trained model VGGish is used as feature extractor to process audio data, and DenseNet is trained by and used as feature extractor for our electroencephalography (EEG) data. The correlation between audio stimuli and EEG is learned in a shared space. In the experiments, we record brain activities (EEG signals) of several subjects while they are listening to mu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 9 publications
0
6
0
Order By: Relevance
“…Table 3 shows that our model is trained on EEG responses with the longest stimuli by a large margin on the order of minutes. Other studies have also attempted to not include feature extraction steps but the stimuli length in their experiments are significantly smaller than ours (in the order of seconds) [13,15]. In comparison to [16], the EEG responses in NMED-T were to unfamiliar stimuli, which in prior works has shown to be the harder case as classification performance drops when listeners are not familiar to the music stimuli [23].…”
Section: Comparisonsmentioning
confidence: 94%
See 1 more Smart Citation
“…Table 3 shows that our model is trained on EEG responses with the longest stimuli by a large margin on the order of minutes. Other studies have also attempted to not include feature extraction steps but the stimuli length in their experiments are significantly smaller than ours (in the order of seconds) [13,15]. In comparison to [16], the EEG responses in NMED-T were to unfamiliar stimuli, which in prior works has shown to be the harder case as classification performance drops when listeners are not familiar to the music stimuli [23].…”
Section: Comparisonsmentioning
confidence: 94%
“…Another study used longer stimuli (∼10 sec) of 8 varying types of vocalizations and was able to achieve ∼61% performance without any feature extraction on EEG passed to DenseNet. Yu et al [15] improved the performance to ∼81% by incorporating canonical correlation analysis between DenseNet and pre-trained VGG model that extracted audio features of the experimental stimuli. Most recently, Sonawane et al [16] improved on these approaches and showed that longer and complex stimuli ( 2 mins of music) could be used to evoke EEG responses used as spectral 2D CNN inputs to classify song ID (∼85.0%).…”
Section: Introductionmentioning
confidence: 99%
“…Yi Yu et al used a convolution neural network called DenseNet [16] for audio event classification [7]. The EEG responses were collected on 9 male participants.…”
Section: Related Workmentioning
confidence: 99%
“…They have used engineered features for processing the EEG data which are dependent on the domain knowledge. There have been few attempts on automatic feature extraction from EEG data using neural networks for song classification task [7].…”
Section: Introductionmentioning
confidence: 99%
“…In addition, we obtain visual features 𝒗 𝑛 ∈ R 𝑑 𝑣 by applying a simple dimension reduction approach, principal component analysis (PCA), to the obtained features in order to prevent over-fitting. Note that application of PCA to CNN features is generally used for dimension reduction [14,41].…”
Section: Heterogeneous Feature Extractionmentioning
confidence: 99%