Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition

Li, Guinan; Yu, Jianwei; Deng, Jiajun; Liu, Xunying; Meng, Helen

doi:10.1109/icassp43922.2022.9747237

Cited by 6 publications

(9 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Li et al [22] created a novel audio-visual deep learning technique that combines auditory and visual data to detect speech from many channels. The separation filters that extract the desired speech from a mixed input of microphones and video frames are constructed by a neural network.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Audio-Visual Source Separation Based Fusion Techniques

Mudhafar,

Al Tmeme

2024

RIA

View full text Add to dashboard Cite

A novel hybrid deep learning model for audio-visual source separation is introduced in this paper, with a specific focus on the precise isolation of a particular speaker's voice from video content. By leveraging both audio and visual characteristics, the achievement of accurate separation of the targeted speech signal is facilitated by our model. Notably, the incorporation of the speaker's facial expressions as an auxiliary cue for enhancing the extraction of their unique vocal qualities is emphasized. Proficiency in audio-visual speech separation and latent representations of distinctive speaker attributes, known as speaker embeddings, is simultaneously acquired by our model through unsupervised learning on unannotated video data. The model employed in this study is speaker-independent, wherein an initial stage of feature extraction is conducted for both audio and visual inputs prior to the subsequent deep modal analysis. The utilization of facial attribute features as an identifying code enables the identification of the speaker's frequency space or other audio properties. The model's efficacy was assessed through evaluation on the widely recognized AVspeech dataset yielding an improvement of 7.7 in terms of source-to-distortion ratio (SDR).

show abstract

Section: Related Workmentioning

confidence: 99%

“…Li et al [22] Estimates target speech separation filters from multiple microphones and video frames. A multi-task framework addresses dereverberation and voice recognition tasks.…”

Section: Makishima Et Al [21]mentioning

confidence: 99%

Audio-Visual Source Separation Based Fusion Techniques

Mudhafar,

Al Tmeme

2024

RIA

View full text Add to dashboard Cite

show abstract

“…Li et al [14] Proposed an AV deep learning approach for multi-channel speech separation by jointly modeling audio-visual cues. It includes a neural network that estimates separation filters for target speech from multiple microphones and video frames, and a multi-task framework for dereverberation and speech recognition.…”

Section: Recent Avss Workmentioning

confidence: 99%

“…One advanced method for audio-visual source separation involves the use of deep learning techniques [10][11][12]14,22,34,[37][38].…”

Section: Introductionmentioning

confidence: 99%

An Overview of Audio-Visual Source Separation Using Deep Learning

Sulaiman,

Al Tmeme,

Mahdi

2023

alkej

View full text Add to dashboard Cite

In this article, the research presents a general overview of deep learning-based AVSS (audio-visual source separation) systems. AVSS has achieved exceptional results in a number of areas, including decreasing noise levels, boosting speech recognition, and improving audio quality. The advantages and disadvantages of each deep learning model are discussed throughout the research as it reviews various current experiments on AVSS. The TCD TIMIT dataset (which contains top-notch audio and video recordings created especially for speech recognition tasks) and the Voxceleb dataset (a sizable collection of brief audio-visual clips with human speech) are just a couple of the useful datasets summarized in the paper that can be used to test AVSS systems. In its basic form, this review aims to highlight the growing importance of AVSS in improving the quality of audio signals.

show abstract

“…In recent years, end-to-end DNN-based microphone array beamforming techniques represented by a) neural timefrequency (TF) masking approaches [7]; b) neural Filter and Sum methods [8,9]; and c) mask-based MVDR [10] and generalized eigenvalues (GEV) [11] approaches have been widely adopted. In addition, incorporating visual information into either multi speech separation front-ends alone [12], or further into speech recognition back-ends [13], can further improve the overall system performance.…”

Section: Introductionmentioning

confidence: 99%

Identity of university Chinese heritage language learners in Hong Kong

Li¹,

李蓁²

View full text Add to dashboard Cite

The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstructionbased pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection.

show abstract

Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition

Cited by 6 publications

References 72 publications

Audio-Visual Source Separation Based Fusion Techniques

Audio-Visual Source Separation Based Fusion Techniques

An Overview of Audio-Visual Source Separation Using Deep Learning

Identity of university Chinese heritage language learners in Hong Kong

Contact Info

Product

Resources

About