Convolutional gated recurrent neural network incorporating spatial features for audio tagging

Xu, Yong; Kong, Qiuqiang; Huang, Qiang; Wang, Wenwu; Plumbley, Mark D.

doi:10.1109/ijcnn.2017.7966291

Cited by 81 publications

(77 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As we have to run inference on large amount of unlabeled data, inference speed is also an important factor along with the accuracy. Our experimented models include ResNet [24], DenseNet and Conv-RNN [25,26] with different lay-ers, which are among the state-of-the-art models for acoustic event detection. According to our experiments, DenseNet-63 achieves highest accuracy and also has relative small inference latency.…”

Section: Experimental Settingmentioning

confidence: 99%

Semi-supervised Acoustic Event Detection Based on Tri-training

Shi

Sun

Kao

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset. Recent acoustic event detectors are based on large-scale neural networks, which are typically trained with huge amounts of labeled data. Labels for acoustic events are expensive to obtain, and relevant acoustic event audios can be limited, especially for rare events. In this paper we leverage an Internet-scale unlabeled dataset with potential domain shift to improve the detection of acoustic events. Based on the classic tri-training approach, our proposed method shows accuracy improvement over both the supervised training baseline, and semisupervised self-training set-up, in all pre-defined acoustic event detection tasks. As our approach relies on ensemble models, we further show the improvements can be distilled to a single model via knowledge distillation, with the resulting single student model maintaining high accuracy of teacher ensemble models.

show abstract

Section: Experimental Settingmentioning

confidence: 99%

Semi-supervised Acoustic Event Detection Based on Tri-training

Shi

Sun

Kao

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Recently, due to the release of the relatively larger labeled data, there has been a plethora of efforts have been made for the audio scene classification task [7], [8]. In brief, the main contributions can be divided into three parts: the representation of the audio signal (or handcrafted feature design) [9], [10], [11]; more sophisticated shallow-architecture classifiers [12], [13], [14] and the applications of deep learning in ASC task [15], [16].…”

Section: Related To Prior Workmentioning

confidence: 99%

“…Indeed, deep learning has witnessed dramatic progress during the last decade and achieved success in several different fields, such as, image classification [16], speech recognition [17], natural language processing [18] and so on. Although, there are some attempts, which employ CNN as the tool to solve the ASC task, most of them tried to solve the problem within the context of using the monaural signals.…”

Section: Related To Prior Workmentioning

confidence: 99%

Mixup-Based Acoustic Scene Classification Using Multi-channel Convolutional Neural Network

Feng

et al. 2018

Advances in Multimedia Information Processing – PCM 2018

View full text Add to dashboard Cite

Audio scene classification, the problem of predicting class labels of audio scenes, has drawn lots of attention during the last several years. However, it remains challenging and falls short of accuracy and efficiency. Recently, Convolutional Neural Network (CNN)-based methods have achieved better performance with comparison to the traditional methods. Nevertheless, conventional single channel CNN may fail to consider the fact that additional cues may be embedded in the multi-channel recordings. In this paper, we explore the use of Multi-channel CNN for the classification task, which aims to extract features from different channels in an end-to-end manner. We conduct the evaluation compared with the conventional CNN and traditional Gaussian Mixture Model-based methods. Moreover, to improve the classification accuracy further, this paper explores the using of mixup method. In brief, mixup trains the neural network on linear combinations of pairs of the representation of audio scene examples and their labels. By employing the mixup approach for data augmentation, the novel model can provide higher prediction accuracy and robustness in contrast with previous models, while the generalization error can also be reduced on the evaluation data.

show abstract

“…When using a large enough dataset that provides satisfactory training data and has a a good representation for each different class, many methods have been successful in performing both of the intermediate tasks. A few methods for audio event detection can be found in [9] and [22], while for audio tagging in [12,24,25,1,19,6]. These tasks are less challenging to train for than Figure 1: Factorisation of the full transcription task.…”

Section: Task Factorisationmentioning

confidence: 99%

“…Furthermore, considering that only chunk level rather than frame-level labels are available, a large set of contextual frames of the chunk were fed into the network to perform this task. In [25,1], the authors use a stacked convolutionla recurrent network to perform environmental audio tagging and tag the presence of birdsong, respectively. While in [19], the authors explore two different models for end-to-end music audio tagging when there is a large amount of training data.…”

Section: Introductionmentioning

confidence: 99%

Deep Learning for Audio Event Detection and Tagging on Low-Resource Datasets

Morfi

Stowell

2018

Applied Sciences

View full text Add to dashboard Cite

In training a deep learning system to perform audio transcription, two practical problems may arise. Firstly, most datasets are weakly labelled, having only a list of events present in each recording without any temporal information for training. Secondly, deep neural networks need a very large amount of labelled training data to achieve good quality performance, yet in practice it is difficult to collect enough samples for most classes of interest. In this paper, we propose factorising the final task of audio transcription into multiple intermediate tasks in order to improve the training performance when dealing with this kind of low-resource datasets. We evaluate three data-efficient approaches of training a stacked convolutional and recurrent neural network for the intermediate tasks.Our results show that different methods of training have different advantages and disadvantages.

show abstract

Convolutional gated recurrent neural network incorporating spatial features for audio tagging

Cited by 81 publications

References 27 publications

Semi-supervised Acoustic Event Detection Based on Tri-training

Semi-supervised Acoustic Event Detection Based on Tri-training

Mixup-Based Acoustic Scene Classification Using Multi-channel Convolutional Neural Network

Deep Learning for Audio Event Detection and Tagging on Low-Resource Datasets

Contact Info

Product

Resources

About