HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Chen, Ke; Du, Xingjian; Zhu, Bilei; Ma, Zejun; Berg-Kirkpatrick, Taylor; Dubnov, Shlomo

doi:10.1109/icassp43922.2022.9746312

Cited by 72 publications

(34 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Audio Spectrogram Transformers: Inspired by the Vision Transformer (ViT) [12], transformers capable of processing images have been adapted to the audio domain. Vision and Audio Spectrogram transformers [16,17,18,19] extract overlapping patches with a certain stride and size of the input image, add a positional encoding, and apply transformer layers to the flattened sequence of patches. Transformer layers use a global attention mechanism that leads to computation and memory complexity scaling quadratically with with respect to the input sequence.…”

Section: Related Workmentioning

confidence: 99%

“…1. Crosses denote models based on Transformer architecture (Audio-MAE [19], HTS-AT [18], PaSST-S [17], PaSST-S-L [17], AST [16], KD-AST [10]) and circles denote models based on CNNs (PSLA [2], ERANN-1-6 [3], Wavegramlogmel-CNN [1], CNN14 [1], KD-CNN [10], MobileNets [7] -ours).…”

Section: Introductionmentioning

confidence: 99%

“…the art in AT [16,17,18,19]. However, Transformers are complex in terms of parameters compared to CNNs, and the global self-attention mechanism scales quadratically with respect to the sequence length, making training and inference slow, and the deployment on edge devices infeasible.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Schmid¹,

Koutini²,

Widmer³

2022

Preprint

View full text Add to dashboard Cite

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from highperforming yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-theart performance of .483 mAP on AudioSet. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Schmid¹,

Koutini²,

Widmer³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In terms of weakly supervised pre-training, the works in [3,5,6,7] proposed novel model architectures for weakly supervised AT and explored the performance on a limited amount of downstream tasks. Further, the works [8,9,10,11,12,13] explored selfsupervised learning with a linear classifier for sound classification, focusing on sound-domain downstream tasks.…”

Section: Related Workmentioning

confidence: 99%

An Empirical Study of Weakly Supervised Audio Tagging Embeddings for General Audio Representations

Dinkel¹,

Zhang²,

Wang³

et al. 2022

The Speaker and Language Recognition Workshop (Odyssey 2022)

View full text Add to dashboard Cite

We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning methods (BYOL-A) as feature extractors. Fourteen downstream tasks are used for evaluation ranging from music instrument classification to language classification. Our results indicate that AT pre-trained models are an excellent transfer learning choice for music, event, and emotion recognition tasks. Further, finetuning AT models can also benefit speech-related tasks such as keyword spotting and intent classification.

show abstract

“…Devlin et al [40] introduced a new language representation model called BERT, and used the pre-trained BERT model with fine-tuning to create stateof-the-art models for a wide range of tasks. Indeed, beyond revolutionising NLP, Transformers have outperform deep learning models constructed with CNNs on various audio classification tasks [41], thus extending its success to the audio domain [42,43,44].…”

Section: Related Workmentioning

confidence: 99%

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Yan¹,

Wang²,

Parada-Cabaleiro³

et al. 2022

Preprint

View full text Add to dashboard Cite

<p>The great variety of human emotional expression as well as the differences in the ways they perceive and annotate them make Speech Emotion Recognition (SER) an ambiguous and challenging task. With the development of deep learning, long-term progress has been made in supervised SER systems. However, the existing convolutional neural networks present certain limitations, such as their inability to well capture global features, which contain important emotional information. In addition, due to the subjective nature and continuity of emotion, the instance segments in which emotional speech is typically segmented do not fully reflect the true labels and cannot describe dynamic temporal changes. Thus, accurate emotional representation cannot be learnt in the process of feature extraction. In order to overtake these limitations, we propose an end-to-end network only for speech that maps sequences of different lengths to a fixed number of chunks and strictly preserves the order of chunks by adaptively adjusting their overlap. Subsequently, it extracts log-mel spectrogram features from chunk-level segments and feeds them into the Residual Multi-Scale Convolutional Neutral Networks with Transformer(RMSCTx) model framework. Finally, by keeping the order of the chunk-level segments, a temporal domain mean layer is used to further extract utterance-level feature representations. With this method, we perform multidimensional SER, i. e., the prediction of arousal, valence, and dominance. The experimental results on three popular corpora demonstrate not only the superiority of our approach, but also the robustness of the model for SER, showing an improvement of the recognition accuracy in the newest version of the public dataset MSP-Podcast (1.9).</p>

show abstract

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Cited by 72 publications

References 19 publications

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

An Empirical Study of Weakly Supervised Audio Tagging Embeddings for General Audio Representations

A Residual Multi-Scale Convolutional Transformer Network with Chunk-level Log-Mel Spectrograms for Speech Emotion Recognition

Contact Info

Product

Resources

About