2021
DOI: 10.3390/s21165500
|View full text |Cite
|
Sign up to set email alerts
|

High Accurate Environmental Sound Classification: Sub-Spectrogram Segmentation versus Temporal-Frequency Attention Mechanism

Abstract: In the important and challenging field of environmental sound classification (ESC), a crucial and even decisive factor is the feature representation ability, which can directly affect the accuracy of classification. Therefore, the classification performance often depends to a large extent on whether the effective representative features can be extracted from the environmental sound. In this paper, we firstly propose a sub-spectrogram segmentation with score level fusion based ESC classification framework, and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(10 citation statements)
references
References 31 publications
(35 reference statements)
0
7
0
Order By: Relevance
“…Chi et al [24] argued that a single spectrogram feature cannot provide enough information, and therefore proposed combining two different spectrogram features before using them for recognition. In addition, to enhance the classification ability of the models, various effective methods have been proposed, such as expanding the dataset using data augmentation [22,25], using multiple deep learning models for joint prediction [26,27], and designing more suitable deep learning models [28][29][30]. However, the sound categories used in these methods are mainly from urban public or indoor environments, and samples from urban forests are less involved, which cannot meet the needs of biodiversity and human activity studies.…”
Section: Introductionmentioning
confidence: 99%
“…Chi et al [24] argued that a single spectrogram feature cannot provide enough information, and therefore proposed combining two different spectrogram features before using them for recognition. In addition, to enhance the classification ability of the models, various effective methods have been proposed, such as expanding the dataset using data augmentation [22,25], using multiple deep learning models for joint prediction [26,27], and designing more suitable deep learning models [28][29][30]. However, the sound categories used in these methods are mainly from urban public or indoor environments, and samples from urban forests are less involved, which cannot meet the needs of biodiversity and human activity studies.…”
Section: Introductionmentioning
confidence: 99%
“…Also, No specific domain knowledge is incorporated in their design which is necessary to achieve superior performance. In [22], in order to distinguish between different frequency bands, a model consisting of an ensemble of several CNNs was proposed, which processes each frequency band separately. Recently, several works have attempted to combine CNN with recurrent neural networks which has improved the CNN performance at the cost of higher model parameters and complexity.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, local T-F patterns are highly shiftinvariance across time axis so that temporal translation has little effect on the classification of sound events.  Spectral characteristics: Compared to other audio signals, environmental sounds have a broader range of frequency information with diverse spectral profiles which are either scattered across frequency bands, concentrated at low, middle or higher frequency bands, or spread across all frequency bands [22], [23]. Also, unlike the time dimension, translation across the frequency dimension can significantly affect the performance of the sound classification [24].…”
Section: Introductionmentioning
confidence: 99%
“…However, these models are not able to perform calculations in parallel. More recently, attention mechanisms have been incorporated to focus on semantically important parts of the sound under study [13][14][15][16][17]. Lately, solutions based on attention models [11,18], particularly on Transformers [18][19][20][21][22], are being explored.…”
Section: Introductionmentioning
confidence: 99%