2017 International Joint Conference on Neural Networks (IJCNN) 2017
DOI: 10.1109/ijcnn.2017.7966291
|View full text |Cite
|
Sign up to set email alerts
|

Convolutional gated recurrent neural network incorporating spatial features for audio tagging

Abstract: Abstract-Environmental audio tagging is a newly proposed task to predict the presence or absence of a specific audio event in a chunk. Deep neural network (DNN) based methods have been successfully adopted for predicting the audio tags in the domestic audio scene. In this paper, we propose to use a convolutional neural network (CNN) to extract robust features from mel-filter banks (MFBs), spectrograms or even raw waveforms for audio tagging. Gated recurrent unit (GRU) based recurrent neural networks (RNNs) are… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
76
0
1

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 81 publications
(77 citation statements)
references
References 27 publications
0
76
0
1
Order By: Relevance
“…As we have to run inference on large amount of unlabeled data, inference speed is also an important factor along with the accuracy. Our experimented models include ResNet [24], DenseNet and Conv-RNN [25,26] with different lay-ers, which are among the state-of-the-art models for acoustic event detection. According to our experiments, DenseNet-63 achieves highest accuracy and also has relative small inference latency.…”
Section: Experimental Settingmentioning
confidence: 99%
“…As we have to run inference on large amount of unlabeled data, inference speed is also an important factor along with the accuracy. Our experimented models include ResNet [24], DenseNet and Conv-RNN [25,26] with different lay-ers, which are among the state-of-the-art models for acoustic event detection. According to our experiments, DenseNet-63 achieves highest accuracy and also has relative small inference latency.…”
Section: Experimental Settingmentioning
confidence: 99%
“…Recently, due to the release of the relatively larger labeled data, there has been a plethora of efforts have been made for the audio scene classification task [7], [8]. In brief, the main contributions can be divided into three parts: the representation of the audio signal (or handcrafted feature design) [9], [10], [11]; more sophisticated shallow-architecture classifiers [12], [13], [14] and the applications of deep learning in ASC task [15], [16].…”
Section: Related To Prior Workmentioning
confidence: 99%
“…Indeed, deep learning has witnessed dramatic progress during the last decade and achieved success in several different fields, such as, image classification [16], speech recognition [17], natural language processing [18] and so on. Although, there are some attempts, which employ CNN as the tool to solve the ASC task, most of them tried to solve the problem within the context of using the monaural signals.…”
Section: Related To Prior Workmentioning
confidence: 99%
“…When using a large enough dataset that provides satisfactory training data and has a a good representation for each different class, many methods have been successful in performing both of the intermediate tasks. A few methods for audio event detection can be found in [9] and [22], while for audio tagging in [12,24,25,1,19,6]. These tasks are less challenging to train for than Figure 1: Factorisation of the full transcription task.…”
Section: Task Factorisationmentioning
confidence: 99%
“…Furthermore, considering that only chunk level rather than frame-level labels are available, a large set of contextual frames of the chunk were fed into the network to perform this task. In [25,1], the authors use a stacked convolutionla recurrent network to perform environmental audio tagging and tag the presence of birdsong, respectively. While in [19], the authors explore two different models for end-to-end music audio tagging when there is a large amount of training data.…”
Section: Introductionmentioning
confidence: 99%