“…However, the experimental work in recent audio-visual speech separation work that we are aware of, including work presented in [15,16,17], has been performed using simulated synthetic mixtures of single talker utterances. Audiovisual multi-talker ASR experiments described in [12] are based on simulated 2-talker mixtures created from randomly selected single talker utterances in the LRS2 corpus. The A/V multi-talker experiments described in Section 4 are performed using the simulated A/V overlapping speech training and test sets described in Section 3.1.…”
Section: Experimental Studymentioning
confidence: 99%
“…Further work has addressed the latency issues associated with multi-talker decoding [11]. An E2E A/V M-T approach has recently been applied to addressing the multi-speaker cocktail party effect [12]. There is also a large body of work on speech separation where the goal is to recover a target speech signal in the presence of background speech [13,14,6].…”
This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces. This essentially resolves the label ambiguity issue associated with most multi-talker modeling approaches which can decode multiple label strings but cannot assign the label strings to the correct speakers. This is implemented as a transformer-transducer based end-to-end model and evaluated using a two speaker audio-visual overlapping speech dataset created from YouTube videos. It is shown in the paper that the VCAM model improves performance with respect to previously reported audio-only and audio-visual multi-talker ASR systems.
“…However, the experimental work in recent audio-visual speech separation work that we are aware of, including work presented in [15,16,17], has been performed using simulated synthetic mixtures of single talker utterances. Audiovisual multi-talker ASR experiments described in [12] are based on simulated 2-talker mixtures created from randomly selected single talker utterances in the LRS2 corpus. The A/V multi-talker experiments described in Section 4 are performed using the simulated A/V overlapping speech training and test sets described in Section 3.1.…”
Section: Experimental Studymentioning
confidence: 99%
“…Further work has addressed the latency issues associated with multi-talker decoding [11]. An E2E A/V M-T approach has recently been applied to addressing the multi-speaker cocktail party effect [12]. There is also a large body of work on speech separation where the goal is to recover a target speech signal in the presence of background speech [13,14,6].…”
This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces. This essentially resolves the label ambiguity issue associated with most multi-talker modeling approaches which can decode multiple label strings but cannot assign the label strings to the correct speakers. This is implemented as a transformer-transducer based end-to-end model and evaluated using a two speaker audio-visual overlapping speech dataset created from YouTube videos. It is shown in the paper that the VCAM model improves performance with respect to previously reported audio-only and audio-visual multi-talker ASR systems.
“…Two types of BC loss are investigated as follows. For example, a SOT reference consisting of N sentences and the speaker change token has been removed can be given as {y 1 1 , ..., y 1 L 1 , ..., y N 1 , ..., y N L N } where y n k means the k-th token in n-th utterance and L n represent the token number of n-th utterance.…”
Section: Boundary Constraint Lossmentioning
confidence: 99%
“…Recently, there has been a lot of exploration in the field of multi-party meetings scenarios [1,2,3,4,5]. Progress has also been advanced with several challenges [6,7,8,9,10,11] and datasets [12,13,14,15,16] specifically focusing on this field.…”
The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundaryaware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.
“…(AV) multi-modal has been applied widely in speech community [6][7][8][9][10][11][12]. The visual information obtained by analyzing lip shapes or facial expressions of the visual modality is more robust than the audio information from complex scenarios.…”
This paper presents the details of our system designed for the Task 1 of Multimodal Information Based Speech Processing (MISP) Challenge 2021. The purpose of Task 1 is to leverage both audio and video information to improve the environmental robustness of far-field wake word spotting. In the proposed system, firstly, we take advantage of speech enhancement algorithms such as beamforming and weighted prediction error (WPE) to address the multi-microphone conversational audio. Secondly, several data augmentation techniques are applied to simulate a more realistic far-field scenario. For the video information, the provided region of interest (ROI) is used to obtain visual representation. Then the multi-layer CNN is proposed to learn audio and visual representations, and these representations are fed into our two-branch attention-based network which can be employed for fusion, such as transformer and conformer. The focal loss is used to fine-tune the model and improve the performance significantly. Finally, multiple trained models are integrated by casting vote to achieve our final 0.091 score.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.