Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party

Wu, Yifei; Li, Chenda; Yang, Song; Wu, Zhongqin; Ye, Qian

doi:10.21437/interspeech.2021-2128

Cited by 8 publications

(5 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the experimental work in recent audio-visual speech separation work that we are aware of, including work presented in [15,16,17], has been performed using simulated synthetic mixtures of single talker utterances. Audiovisual multi-talker ASR experiments described in [12] are based on simulated 2-talker mixtures created from randomly selected single talker utterances in the LRS2 corpus. The A/V multi-talker experiments described in Section 4 are performed using the simulated A/V overlapping speech training and test sets described in Section 3.1.…”

Section: Experimental Studymentioning

confidence: 99%

“…Further work has addressed the latency issues associated with multi-talker decoding [11]. An E2E A/V M-T approach has recently been applied to addressing the multi-speaker cocktail party effect [12]. There is also a large body of work on speech separation where the goal is to recover a target speech signal in the presence of background speech [13,14,6].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

End-to-end multi-talker audio-visual ASR using an active speaker attention module

Rose¹,

Siohan²

2022

Preprint

View full text Add to dashboard Cite

This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces. This essentially resolves the label ambiguity issue associated with most multi-talker modeling approaches which can decode multiple label strings but cannot assign the label strings to the correct speakers. This is implemented as a transformer-transducer based end-to-end model and evaluated using a two speaker audio-visual overlapping speech dataset created from YouTube videos. It is shown in the paper that the VCAM model improves performance with respect to previously reported audio-only and audio-visual multi-talker ASR systems.

show abstract

Section: Experimental Studymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

End-to-end multi-talker audio-visual ASR using an active speaker attention module

Rose¹,

Siohan²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Two types of BC loss are investigated as follows. For example, a SOT reference consisting of N sentences and the speaker change token has been removed can be given as {y 1 1 , ..., y 1 L 1 , ..., y N 1 , ..., y N L N } where y n k means the k-th token in n-th utterance and L n represent the token number of n-th utterance.…”

Section: Boundary Constraint Lossmentioning

confidence: 99%

“…Recently, there has been a lot of exploration in the field of multi-party meetings scenarios [1,2,3,4,5]. Progress has also been advanced with several challenges [6,7,8,9,10,11] and datasets [12,13,14,15,16] specifically focusing on this field.…”

Section: Introductionmentioning

confidence: 99%

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Liang

Chen

et al. 2022

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

View full text Add to dashboard Cite

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

show abstract

“…(AV) multi-modal has been applied widely in speech community [6][7][8][9][10][11][12]. The visual information obtained by analyzing lip shapes or facial expressions of the visual modality is more robust than the audio information from complex scenarios.…”

Section: Introductionmentioning

confidence: 99%

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

Yan¹,

Sun²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformer. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.

show abstract

Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party

Cited by 8 publications

References 22 publications

End-to-end multi-talker audio-visual ASR using an active speaker attention module

End-to-end multi-talker audio-visual ASR using an active speaker attention module

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

Contact Info

Product

Resources

About