2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 2020
DOI: 10.1109/wacv45572.2020.9093307
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Model Distillation Using Acoustic Images

Abstract: In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality. Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval. However, such representations are not so robust towards variable environmental sound conditions. We tackle this drawback by exploiting a new multimodal labeled action recogni… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
27
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(27 citation statements)
references
References 31 publications
0
27
0
Order By: Relevance
“…To satisfy these requirements, knowledge distillation is widely studied and applied in many speech recognition tasks. There are many knowledge distillation systems for designing lightweight deep acoustic models for speech recognition (Chebotar and Waters, 2016;Wong and Gales, 2016;Chan et al, 2015;Price et al, 2016;Fukuda et al, 2017;Bai et al, 2019b;Ng et al, 2018;Albanie et al, 2018;Lu et al, 2017;Shi et al, 2019a;Roheda et al, 2018;Shi et al, 2019b;Gao et al, 2019;Ghorbani et al, 2018;Takashima et al, 2018;Watanabe et al, 2017;Shi et al, 2019c;Asami et al, 2017;Huang et al, 2018;Shen et al, 2018;Perez et al, 2020;Shen et al, 2019a;Oord et al, 2018). In particular, these KD-based speech recognition applications have spoken language identification (Shen et al, 2018(Shen et al, , 2019a), text-independent speaker recognition (Ng et al, 2018), audio classification (Gao et al, 2019;Perez et al, 2020), speech enhancement (Watanabe et al, 2017), acoustic event detection (Price et al, 2016;Shi et al, 2019a,b), speech synthesis (Oord et al, 2018) and so on.…”
Section: Kd In Speech Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…To satisfy these requirements, knowledge distillation is widely studied and applied in many speech recognition tasks. There are many knowledge distillation systems for designing lightweight deep acoustic models for speech recognition (Chebotar and Waters, 2016;Wong and Gales, 2016;Chan et al, 2015;Price et al, 2016;Fukuda et al, 2017;Bai et al, 2019b;Ng et al, 2018;Albanie et al, 2018;Lu et al, 2017;Shi et al, 2019a;Roheda et al, 2018;Shi et al, 2019b;Gao et al, 2019;Ghorbani et al, 2018;Takashima et al, 2018;Watanabe et al, 2017;Shi et al, 2019c;Asami et al, 2017;Huang et al, 2018;Shen et al, 2018;Perez et al, 2020;Shen et al, 2019a;Oord et al, 2018). In particular, these KD-based speech recognition applications have spoken language identification (Shen et al, 2018(Shen et al, , 2019a), text-independent speaker recognition (Ng et al, 2018), audio classification (Gao et al, 2019;Perez et al, 2020), speech enhancement (Watanabe et al, 2017), acoustic event detection (Price et al, 2016;Shi et al, 2019a,b), speech synthesis (Oord et al, 2018) and so on.…”
Section: Kd In Speech Recognitionmentioning
confidence: 99%
“…Most existing knowledge distillation methods for speech recognition, use teacher-student architectures to improve the efficiency and recognition accuracy of acoustic models (Chan et al, 2015;Chebotar and Waters, 2016;Lu et al, 2017;Price et al, 2016;Shen et al, 2018;Gao et al, 2019;Shen et al, 2019a;Shi et al, 2019c,a;Watanabe et al, 2017;Perez et al, 2020). Using a recurrent neural network (RNN) for holding the temporal information from speech sequences, the knowledge from the teacher RNN acoustic model was transferred into a small student DNN model (Chan et al, 2015).…”
Section: Kd In Speech Recognitionmentioning
confidence: 99%
“…c) A model that aims to support the editorial decision process should only assume the availability of human review text during training, and be able to make recommendations in their absence. Inspired by missing modality hallucination methods (Hoffman et al, 2016;Tang et al;Pérez et al, 2020)), we propose a realistic system that uses all available data for training, but imputes review representations at test time based on the abstract text.…”
Section: Contributionsmentioning
confidence: 99%
“…Figure1depicts an overview of our architecture. Inspired by modality hallucination studies(Hoffman et al, 2016;Pérez et al, 2020), we use the abstract module to predict both the abstract h abs i…”
mentioning
confidence: 99%
“…Alwassel et al [399] proposed a self-supervised method, called Cross-Modal Deep Clustering (XDC), to utilize the semantic correlation and the differences between RGB and audio modalities. In the work of [400], audio deep learning models were trained, and the visual and acoustic images were exploited in a teacher-student fashion.…”
Section: Co-learning With Visual and Sensor Modalitiesmentioning
confidence: 99%