Masked Conditional Neural Networks for Environmental Sound Classification

Medhat, Fady; Chesmore, David

doi:10.1007/978-3-319-71078-5_2

Cited by 12 publications

(11 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used the model specified in Table I and the signal representation (60 mel-spec with delta) discussed earlier, which is the same transformation used by Piczak-CNN [7]. The dataset is pre-distributed into 10-folds, which we used to report the mean accuracy in Table II. The shallow MCLNN in combination with a long segment (k=50) achieved an accuracy of 74.22% compared to a deep MCLNN with a shorter segment (k=5) in [24]. The accuracy of the MCLNN surpasses other reported neural networks based attempts using state-of-the-art CNN architectures proposed by Salamon et al in [31] and Piczak in [7].…”

Section: A Urbansound8kmentioning

confidence: 73%

“…We have performed the MCLNN evaluation using the Urbansound8k [28], YorNoise [24], ESC-10 [29] and ESC-50 [29] environmental sound datasets. We will discuss the composition of each dataset with the common preprocessing applied, and we will defer the discussion to each dataset's relevant section.…”

Section: Methodsmentioning

confidence: 99%

“…We will discuss the composition of each dataset with the common preprocessing applied, and we will defer the discussion to each dataset's relevant section. In this work, we explore the performance of a shallow architecture of the MCLNN in combination with a long segment compared to the deep MCLNN architectures considered for the mentioned datasets in [23][24][25].…”

Section: Methodsmentioning

confidence: 99%

“…Table III lists the mean accuracies achieved over a 10-fold cross-validation for both the Urbansound8k and YorNoise combined. A shallow MCLNN achieved an accuracy of 75.92% compared to the deep architecture in [24] that reached 75.13%. Despite the comparable accuracy, the shallow MCLNN used 1 million parameters compared to the 3 million parameters of the deep variant and achieved higher accuracy using a longer segment.…”

Section: B Yornoisementioning

confidence: 97%

“…Additionally, Piczak used augmentation, which involves introducing deformations to sound signal, e.g. time delay, pitch [24] 75.13 shifting. Piczak applied 10 augmentation variants to each sound file, which increases the dataset and consequently the accuracy as studied by Salamon in [31].…”

Section: Esc-10mentioning

confidence: 99%

See 4 more Smart Citations

Recognition of Acoustic Events Using Masked Conditional Neural Networks

Medhat

Chesmore

2017

2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)

Self Cite

View full text Add to dashboard Cite

Abstract-Automatic feature extraction using neural networks has accomplished remarkable success for images, but for sound recognition, these models are usually modified to fit the nature of the multi-dimensional temporal representation of the audio signal in spectrograms. This may not efficiently harness the timefrequency representation of the signal. The ConditionaL Neural Network (CLNN) takes into consideration the interrelation between the temporal frames, and the Masked ConditionaL Neural Network (MCLNN) extends upon the CLNN by forcing a systematic sparseness over the network's weights using a binary mask. The masking allows the network to learn about frequency bands rather than bins, mimicking a filterbank used in signal transformations such as MFCC. Additionally, the Mask is designed to consider various combinations of features, which automates the feature hand-crafting process. We applied the MCLNN for the Environmental Sound Recognition problem using the Urbansound8k, YorNoise, ESC-10 and ESC-50 datasets. The MCLNN have achieved competitive performance compared to state-of-the-art Convolutional Neural Networks and hand-crafted attempts.

show abstract