2021
DOI: 10.1109/taslp.2021.3120633
|View full text |Cite
|
Sign up to set email alerts
|

PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation

Abstract: Audio event classification is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing the classification accuracy, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio event classification models with AudioSet, but have not received the attention they deserve. To fill the gap, in this work, we present… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
87
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 92 publications
(87 citation statements)
references
References 49 publications
0
87
0
Order By: Relevance
“…The AudioSet contains over two million 10-sec audio samples labeled with 527 sound event classes. In this paper, we follow the same training pipeline in [11,12,14] by using the full-train set (2M samples) to our model and it evaluation set (22K samples). All samples are converted to mono as 1 channel by 32kHz rate.…”
Section: Dataset and Training Detailmentioning
confidence: 99%
See 2 more Smart Citations
“…The AudioSet contains over two million 10-sec audio samples labeled with 527 sound event classes. In this paper, we follow the same training pipeline in [11,12,14] by using the full-train set (2M samples) to our model and it evaluation set (22K samples). All samples are converted to mono as 1 channel by 32kHz rate.…”
Section: Dataset and Training Detailmentioning
confidence: 99%
“…We follow [11,12] to use the balance sampler, α = 0.5 mix-up [19], spectrogram masking [20] with time-mask=128 frames and frequency-mask=16 bins, and weight averaging. The HTS-AT is implemented in Pytorch and trained via the AdamW optimizer (β 1 =0.9, β 2 =0.999, eps=1e-8, de-cay=0.05) with a batch size of 128 (32 × 4) in 4 NVIDIA Tesla V-100 GPUs.…”
Section: Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…The same group [17] conducted quality assessments of these labels ranging from 0-100%. Label enhancement [7] involves altering the original labels. CNN14 [10] achieves 0.442 mean average precision (mAP) using 128 mel bins.…”
Section: At Background and Related Workmentioning
confidence: 99%
“…Vision Transformers (ViT) [6]-in computer vision and other fields, and competitive performances are reported. Recently, AST [2] and PSLA [7] improved the SOTA performance of the AT task 3 on the AudioSet benchmark by leveraging a suite of improvements including DeiT [1] ( distilled ViT) architecture, ImageNet pretraining, data augmentations, and ensemble. However, there is still no clear "winner-takes-all" approach in audio classification tasks that can have the best performance while being efficient.…”
Section: Introductionmentioning
confidence: 99%