MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection

Zhang, Keming; Cai, Yuanwen; Ren, Yuan; Ye, Ruida; He, Liang

doi:10.1109/access.2020.3015047

Cited by 13 publications

(5 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Zhang et al 14 propose an AED module called Multi-Scale Time-Frequency Attention (MTFA) it informs the model where to focus along the time and frequency axes by collecting data at different resolutions, which has not been taken care in the past. Zhang et al 15 and Shen et al 16 proposed a a multiscale time-frequency convolutional recurrent neural network (MTF-CRNN) for sound event detection 15 .…”

Section: Related Workmentioning

confidence: 99%

Gun identification from gunshot audios for secure public places using transformer learning

Nijhawan

Ansari

Kumar

et al. 2022

Sci Rep

View full text Add to dashboard Cite

Increased mass shootings and terrorist activities severely impact society mentally and physically. Development of real-time and cost-effective automated weapon detection systems increases a sense of safety in public. Most of the previously proposed methods were vision-based. They visually analyze the presence of a gun in a camera frame. This research focuses on gun-type (rifle, handgun, none) detection based on the audio of its shot. Mel-frequency-based audio features have been used. We compared both convolution-based and fully self-attention-based (transformers) architectures. We found transformer architecture generalizes better on audio features. Experimental results using the proposed transformer methodology on audio clips of gunshots show classification accuracy of 93.87%, with training loss and validation loss of 0.2509 and 0.1991, respectively. Based on experiments, we are convinced that our model can effectively be used as both a standalone system and in association with visual gun-detection systems for better security.

show abstract

Section: Related Workmentioning

confidence: 99%

Gun identification from gunshot audios for secure public places using transformer learning

Nijhawan

Ansari

Kumar

et al. 2022

Sci Rep

View full text Add to dashboard Cite

show abstract

“…In [39], the Multi-level Convolutional Pyramid Semantic Fusion (MCPSF) framework was proposed to integrate multi-level semantic features extracted by bag-of-visual-words (BoVW) model and convolutional neural network (CNN) model. Zhang et al [40] proposed a Multi-scale Time-Frequency Convolutional Recurrent Neural Network (MTF-CRNN) for sound time frequence map detection to improve sound event detection performance. Ding et al [41] proposed an Adaptive Multi-scale Detection (AdaMD) method, based on the hourglass neural network and the Gated Recurrent Unit (GRU) module, to extract different scale characteristics of time-frequency map.…”

Section: B Multi-level Structurementioning

confidence: 99%

MLAN: Multi-Level Attention Network

et al. 2022

View full text Add to dashboard Cite

In this paper, we proposed a "Multi-Level Attention Network" (MLAN), which defines a multi-level structure, including layer, block, and group levels to get hierarchical attention and combines corresponding residual information for better feature extraction. We also constructed a shared mask attention module (SMA) which can significantly reduce the number of parameters compared with conventional attention methods. Based on the MLAN and SMA, we further investigated a variety of information fusion modules for better feature fusion at different levels. We conducted classification task experiments based on the ResNet backbone with different depths, and the experimental results show that our method has a significant performance improvement over the backbone on CIFAR10 and CIFAR100 datasets. Meanwhile, compared with the mainstream attention methods, our MLAN performs better with higher accuracy as well as less parameters and computation complexity. We also visualized some intermediate feature maps and explained why our MLAN performs well.INDEX TERMS multi-level structure; shared mask attention; hierarchical attention aggregation; information fusion.

show abstract

“…Gaussian mixture models [6] and hidden Markov models [7] were initially used. Following the advancement of deep learning algorithms, SED methods utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs) were introduced [8,9,10,11]. SED performs the functions of the human auditory system in several industries, including audio surveillance [12] and social welfare [13].…”

Section: Introductionmentioning

confidence: 99%

SELD U-Net: Joint Optimization of Sound Event Localization and Detection With Noise Reduction

Shin,

Kim,

Choi

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Sound event localization and detection (SELD) is a combined task that classifies acoustic events from audio signals, estimates temporal boundaries, and identifies event locations. With the advancement of industries utilizing audio signals, SELD has been applied in various fields, and deeplearning-based research is being conducted for its effective application. However, current deep-learningbased SELD research focuses mainly on performance improvement in noise-free environments, which leads to performance degradation issues in noisy environments. To address this problem, this study proposes a robust SELD U-Net model that performs SELD in noisy environments. The proposed model combines a U-Net to remove noise and a SELDnet to perform SELD. The proposed model was trained and evaluated using noisy environmental data with various sizes. Consequently, it was confirmed that the proposed model has superior performance compared with existing deep learning-based SELD models in environments with high levels of noise.

show abstract

MTF-CRNN: Multiscale Time-Frequency Convolutional Recurrent Neural Network for Sound Event Detection

Cited by 13 publications

References 38 publications

Gun identification from gunshot audios for secure public places using transformer learning

Gun identification from gunshot audios for secure public places using transformer learning

MLAN: Multi-Level Attention Network

SELD U-Net: Joint Optimization of Sound Event Localization and Detection With Noise Reduction

Contact Info

Product

Resources

About