Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification

Primus, Paul; Eghbal-zadeh, Hamid; Eitelsebner, David; Koutini, Khaled; Arzt, Andreas; Widmer, Gerhard

doi:10.48550/arxiv.1909.02869

Cited by 2 publications

(2 citation statements)

References 7 publications

(10 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The huge amounts of video data that are becoming more and more available through online sources, enable and at the same time require continuously better performance in tasks like activity recognition, video saliency, scene analysis or video summarization, imposing the need to exploit not only spatial information, but also temporal [7,47,27]. Similar advances have also been achieved in audio processing areas, such as acoustic event detection [44], speech recognition [17,9], sound localiza-tion [56], by using deep learning techniques.…”

Section: Introductionmentioning

confidence: 96%

STAViS: Spatio-Temporal AudioVisual Saliency Network

Tsiami¹,

Koutras²,

Maragos³

2020

Preprint

View full text Add to dashboard Cite

We introduce STAViS, a spatio-temporal audiovisual saliency network that combines spatio-temporal visual and auditory information in order to efficiently address the problem of saliency estimation in videos. Our approach employs a single network that combines visual saliency and auditory features and learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. The network has been designed, trained end-to-end, and evaluated on six different databases that contain audiovisual eye-tracking data of a large variety of videos. We compare our method against 8 different stateof-the-art visual saliency models. Evaluation results across databases indicate that our STAViS model outperforms our visual only variant as well as the other state-of-the-art models in the majority of cases. Also, the consistently good performance it achieves for all databases indicates that it is appropriate for estimating saliency "in-the-wild".

show abstract

Section: Introductionmentioning

confidence: 96%

STAViS: Spatio-Temporal AudioVisual Saliency Network

Tsiami¹,

Koutras²,

Maragos³

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…In particular, spectrum correction [23] and channel conversion [24] build a front-end module to convert speech features from the source domain to target domain before feeding them to the back-end classifier. Besides front-end features, mid-level feature based transfer systems, which uses bottleneck features [25] or hidden layer representations [26] are adopted to transfer knowledge from source to target domain. Adversarial training methods in [27,28] leverage an extra domain discriminator to solve the device mismatch problem although the key focus is on lack of labeled target data.…”

Section: Introductionmentioning

confidence: 99%

Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification

Siniscalchi

Wang

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

In this paper, we propose a domain adaptation framework to address the device mismatch issue in acoustic scene classification leveraging upon neural label embedding (NLE) and relational teacher student learning (RTSL). Taking into account the structural relationships between acoustic scene classes, our proposed framework captures such relationships which are intrinsically device-independent. In the training stage, transferable knowledge is condensed in NLE from the source domain. Next in the adaptation stage, a novel RTSL strategy is adopted to learn adapted target models without using paired sourcetarget data often required in conventional teacher student learning. The proposed framework is evaluated on the DCASE 2018 Task1b data set. Experimental results based on AlexNet-L deep classification models confirm the effectiveness of our proposed approach for mismatch situations. NLE-alone adaptation compares favourably with the conventional device adaptation and teacher student based adaptation techniques. NLE with RTSL further improves the classification accuracy.

show abstract

Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene Classification

Cited by 2 publications

References 7 publications

STAViS: Spatio-Temporal AudioVisual Saliency Network

STAViS: Spatio-Temporal AudioVisual Saliency Network

Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification

Contact Info

Product

Resources

About