2018
DOI: 10.3390/app8071152
|View full text |Cite
|
Sign up to set email alerts
|

An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition

Abstract: Convolutional neural networks (CNNs) with log-mel audio representation and CNN-based end-to-end learning have both been used for environmental event sound recognition (ESC). However, log-mel features can be complemented by features learned from the raw audio waveform with an effective fusion method. In this paper, we first propose a novel stacked CNN model with multiple convolutional layers of decreasing filter sizes to improve the performance of CNN models with either log-mel feature input or raw waveform inp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
62
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 92 publications
(63 citation statements)
references
References 22 publications
0
62
1
Order By: Relevance
“…In this study, through a series of classifications for three study sites, we found that: (1) The proposed CNN-Polygon approach works well for land cover classification even when the number of training samples is very small, and (2) the proposed CNN-Matrix performs well when multi-temporal data are used for classification and the training sample size is relatively large. In particular, this study showed that the types (and structures) of input features are a critical consideration of CNN-based classification [73,74]. More visually intuitive input features tend to increase the classification accuracy even when training data are limited.…”
Section: Novelty and Limitationsmentioning
confidence: 84%
“…In this study, through a series of classifications for three study sites, we found that: (1) The proposed CNN-Polygon approach works well for land cover classification even when the number of training samples is very small, and (2) the proposed CNN-Matrix performs well when multi-temporal data are used for classification and the training sample size is relatively large. In particular, this study showed that the types (and structures) of input features are a critical consideration of CNN-based classification [73,74]. More visually intuitive input features tend to increase the classification accuracy even when training data are limited.…”
Section: Novelty and Limitationsmentioning
confidence: 84%
“…As shown on Table II, for the ESC-10 dataset, TimeScaleNet only allows to achieve environmental sound classification with a mean accuracy of 69.71% and a standard deviation of 1.91% across the five folds. This result is far from matching the best results on environmental sound classification using raw audio on the ESC-10 dataset [53]. In [53], the authors described RawNet, whose intent is also to achieve joint feature learning in the time domain, along with sound classification.…”
Section: B Environmental Sound Classification Performance Evaluationmentioning
confidence: 91%
“…This result is far from matching the best results on environmental sound classification using raw audio on the ESC-10 dataset [53]. In [53], the authors described RawNet, whose intent is also to achieve joint feature learning in the time domain, along with sound classification. Their approach allowed to achieve 85.2% of accuracy, which is much better than the obtained performance of TimeScaleNet using the ESC-10 dataset, which only slightly outperforms the baseline methods proposed by the maintainer of the dataset in [48] and [54].…”
Section: B Environmental Sound Classification Performance Evaluationmentioning
confidence: 91%
See 1 more Smart Citation
“…Recently, deep learning has been shown to achieve impressive results in areas such as computer vision, image and video processing, speech, and natural language processing [19]. Deep learning methods have also been applied to fault diagnosis [20][21][22] and the RUL estimation problem. For the RUL estimation problem, a Convolutional Neural Network (CNN) is applied on a sliding window with multi-times weight and failures height in [23].…”
Section: Introductionmentioning
confidence: 99%