Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech

Yu, Jianwei; Zhang, Shixiong; Liu, Shansong; Hu, Shoukang; Liu, Xunying; Meng, Helen; Yu, Dong

doi:10.1109/taslp.2021.3078883

Cited by 20 publications

(24 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following the previous researches on audio-visual multi-channel speech separation [35,36], temporal convolutional networks (TCNs) [39] are used in the speech separation module. As shown in the left corner of Figure 1, the log-power spectrum (LPS) features of the reference microphone channel were initially concatenated with the IPDs and AF features computed above before being fed into the TCN based audio block to compute the audio embedding.…”

Section: Audio and Visual Modality Inputsmentioning

confidence: 99%

“…Before fusing the visual features with the audio embedding to improve the estimation, the lip features are firstly fed into the visual block containing 5 TCNs (Figure 1, bottom left in grey) to compute the visual embedding. Audio-visual modality fusion: In this work, a factorised attentionbased modality fusion method consistent with our previous work [35] was utilised in the separation module. This attention based fusion block (Figure 1, left middle in dark brown) combines the audio and visual embeddings from the outputs of the audio and visual TCN embedding blocks respectively.…”

Section: Audio and Visual Modality Inputsmentioning

confidence: 99%

“…To this end, a tighter integration between system components, for example, the separation and dereverberation modules, can be achieved via joint fine-tuning on the dereverberation MSE cost alone, or an interpolated SI-SNR and MSE error loss function. 1 Their further integration of the audio only or audio-visual CLDNN based back-end recognition component was performed by fine-tuning using the LF-MMI sequence training criterion [35] given the enhanced outputs. Table 1.…”

Section: Integration Of Enhancement Front-end and Recognition Back-endmentioning

confidence: 99%

“…Simulated mixed speech: The multi-channel overlapped-noisyreverberant speech is simulated using the LRS2 dataset. A 15channel symmetric linear array described in [35] is used in the simulation process. 843-point source noises and 40000 Room Impulse Responses (RIRs) generated by the image method [41] in 400 different simulated rooms were used in our experiment.…”

Section: Experiments Setupmentioning

confidence: 99%

“…The main contributions of this paper are summarized below: First, to the best of our knowledge, this paper presents the first use of a complete audio-visual multi-channel speech separation, dereverberation and recognition system architecture featuring a full incorporation of visual information into all three stages. In contrast, prior researches incorporate video modality in either only the speech enhancement front-end [25,26,28], recognition back-end [30][31][32][33], or both multi-channel speech separation and recognition stages [35,36] but excluding the dereverberation component. Second, a more complete experimental validation of the advantage of audio-visual versus audio only dereverberation approaches of multiple forms (DNN-WPE, spectral mapping) is presented, as previous research [37] only considered the spectral mapping method.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Audio-visual multi-channel speech separation, dereverberation and recognition

Li¹,

Yu²,

Deng³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Despite the rapid advance of automatic speech recognition (ASR) technologies, accurate recognition of cocktail party speech characterised by the interference from overlapping speakers, background noise and room reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, audio-visual speech enhancement techniques have been developed, although predominantly targeting overlapping speech separation and recognition tasks. In this paper, an audiovisual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all three stages of the system is proposed. The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches based on DNN-WPE and spectral mapping respectively. The learning cost function mismatch between the separation and dereverberation models and their integration with the back-end recognition system is minimised using fine-tuning on the MSE and LF-MMI criteria. Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline audio-visual multi-channel speech separation and recognition system containing no dereverberation module by a statistically significant word error rate (WER) reduction of 2.06 % absolute (8.77 % relative).

show abstract

Section: Audio and Visual Modality Inputsmentioning

confidence: 99%

Section: Audio and Visual Modality Inputsmentioning

confidence: 99%

Section: Integration Of Enhancement Front-end and Recognition Back-endmentioning

confidence: 99%

Section: Experiments Setupmentioning

confidence: 99%