High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism

Qiao, Tianhao; Zhang, Shunqing; Cao, Shan; Xu, Shugong

doi:10.3390/s21165500

Cited by 7 publications

(10 citation statements)

References 31 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chi et al [24] argued that a single spectrogram feature cannot provide enough information, and therefore proposed combining two different spectrogram features before using them for recognition. In addition, to enhance the classification ability of the models, various effective methods have been proposed, such as expanding the dataset using data augmentation [22,25], using multiple deep learning models for joint prediction [26,27], and designing more suitable deep learning models [28][29][30]. However, the sound categories used in these methods are mainly from urban public or indoor environments, and samples from urban forests are less involved, which cannot meet the needs of biodiversity and human activity studies.…”

Section: Introductionmentioning

confidence: 99%

Classification of Complicated Urban Forest Acoustic Scenes with Deep Learning Models

Zhang

Zhan

Hao

et al. 2023

Forests

View full text Add to dashboard Cite

The use of passive acoustic monitoring (PAM) can compensate for the shortcomings of traditional survey methods on spatial and temporal scales and achieve all-weather and wide-scale assessment and prediction of environmental dynamics. Assessing the impact of human activities on biodiversity by analyzing the characteristics of acoustic scenes in the environment is a frontier hotspot in urban forestry. However, with the accumulation of monitoring data, the selection and parameter setting of the deep learning model greatly affect the content and efficiency of sound scene classification. This study compared and evaluated the performance of different deep learning models for acoustic scene classification based on the recorded sound data from Guangzhou urban forest. There are seven categories of acoustic scenes for classification: human sound, insect sound, bird sound, bird–human sound, insect–human sound, bird–insect sound, and silence. A dataset containing seven acoustic scenes was constructed, with 1000 samples for each scene. The requirements of the deep learning models on the training data volume and training epochs in the acoustic scene classification were evaluated through several sets of comparison experiments, and it was found that the models were able to achieve satisfactory accuracy when the training sample data volume for a single category was 600 and the training epochs were 100. To evaluate the generalization performance of different models to new data, a small test dataset was constructed, and multiple trained models were used to make predictions on the test dataset. All experimental results showed that the DenseNet_BC_34 model performs best among the comparison models, with an overall accuracy of 93.81% for the seven acoustic scenes on the validation dataset. This study provides practical experience for the application of deep learning techniques in urban sound monitoring and provides new perspectives and technical support for further exploring the relationship between human activities and biodiversity.

show abstract

Section: Introductionmentioning

confidence: 99%

Classification of Complicated Urban Forest Acoustic Scenes with Deep Learning Models

Zhang

Zhan

Hao

et al. 2023

Forests

View full text Add to dashboard Cite

show abstract

“…Also, No specific domain knowledge is incorporated in their design which is necessary to achieve superior performance. In [22], in order to distinguish between different frequency bands, a model consisting of an ensemble of several CNNs was proposed, which processes each frequency band separately. Recently, several works have attempted to combine CNN with recurrent neural networks which has improved the CNN performance at the cost of higher model parameters and complexity.…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, local T-F patterns are highly shiftinvariance across time axis so that temporal translation has little effect on the classification of sound events.  Spectral characteristics: Compared to other audio signals, environmental sounds have a broader range of frequency information with diverse spectral profiles which are either scattered across frequency bands, concentrated at low, middle or higher frequency bands, or spread across all frequency bands [22], [23]. Also, unlike the time dimension, translation across the frequency dimension can significantly affect the performance of the sound classification [24].…”

Section: Introductionmentioning

confidence: 99%

Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling

Seresht

Mohammadi

2023

IEEE Access

View full text Add to dashboard Cite

Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Recently, Convolutional Neural Networks (CNNs) have taken the lead from traditional approaches and have produced promising results. However, the achieved improvements are often accompanied by increasing depth, complexity, and size of the network, which prevents their usage in many practical applications. In this work, our goal is to empower a small-size low-complexity CNN model to achieve superior performance. To this end, we concentrate on the importance of global pooling technique, which is less investigated in ESC. In most previous works, models utilize global average pooling layer which does not consider regional saliency, and thus weakens the salient time-frequency regions contributions to the classification, and also to the training of convolutional kernels. We propose a novel global pooling method, called Sparse Salient Region Pooling (SSRP), which computes the channel descriptors using a sparse subset of features, and guides the model to effectively learn from the more salient time-frequency regions. Experimental results demonstrate that the proposed model with only 700K parameters yields accuracies of 86.7% on ESC-50 and 94.8% on ESC-10, which are comparable to that of the state-of-the-art methods. Compared to the baseline model, our model achieves absolute improvement of 21.8% in accuracy on ESC-50, with 98% smaller model size. Our visual analyses show that SSRP intensifies the responses of low-energy regions such that they contribute even more than high-energy regions to the classification of specific sound classes.

show abstract

“…However, these models are not able to perform calculations in parallel. More recently, attention mechanisms have been incorporated to focus on semantically important parts of the sound under study [13][14][15][16][17]. Lately, solutions based on attention models [11,18], particularly on Transformers [18][19][20][21][22], are being explored.…”

Section: Introductionmentioning

confidence: 99%

Transformers for Urban Sound Classification—A Comprehensive Performance Evaluation

Nogueira

Oliveira

Machado

et al. 2022

Sensors

View full text Add to dashboard Cite

Many relevant sound events occur in urban scenarios, and robust classification models are required to identify abnormal and relevant events correctly. These models need to identify such events within valuable time, being effective and prompt. It is also essential to determine for how much time these events prevail. This article presents an extensive analysis developed to identify the best-performing model to successfully classify a broad set of sound events occurring in urban scenarios. Analysis and modelling of Transformer models were performed using available public datasets with different sets of sound classes. The Transformer models’ performance was compared to the one achieved by the baseline model and end-to-end convolutional models. Furthermore, the benefits of using pre-training from image and sound domains and data augmentation techniques were identified. Additionally, complementary methods that have been used to improve the models’ performance and good practices to obtain robust sound classification models were investigated. After an extensive evaluation, it was found that the most promising results were obtained by employing a Transformer model using a novel Adam optimizer with weight decay and transfer learning from the audio domain by reusing the weights from AudioSet, which led to an accuracy score of 89.8% for the UrbanSound8K dataset, 95.8% for the ESC-50 dataset, and 99% for the ESC-10 dataset, respectively.

show abstract

High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism

Cited by 7 publications

References 31 publications

Classification of Complicated Urban Forest Acoustic Scenes with Deep Learning Models

Classification of Complicated Urban Forest Acoustic Scenes with Deep Learning Models

Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling

Transformers for Urban Sound Classification—A Comprehensive Performance Evaluation

Contact Info

Product

Resources

About