Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-2128
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 22 publications
0
5
0
Order By: Relevance
“…However, the experimental work in recent audio-visual speech separation work that we are aware of, including work presented in [15,16,17], has been performed using simulated synthetic mixtures of single talker utterances. Audiovisual multi-talker ASR experiments described in [12] are based on simulated 2-talker mixtures created from randomly selected single talker utterances in the LRS2 corpus. The A/V multi-talker experiments described in Section 4 are performed using the simulated A/V overlapping speech training and test sets described in Section 3.1.…”
Section: Experimental Studymentioning
confidence: 99%
See 1 more Smart Citation
“…However, the experimental work in recent audio-visual speech separation work that we are aware of, including work presented in [15,16,17], has been performed using simulated synthetic mixtures of single talker utterances. Audiovisual multi-talker ASR experiments described in [12] are based on simulated 2-talker mixtures created from randomly selected single talker utterances in the LRS2 corpus. The A/V multi-talker experiments described in Section 4 are performed using the simulated A/V overlapping speech training and test sets described in Section 3.1.…”
Section: Experimental Studymentioning
confidence: 99%
“…Further work has addressed the latency issues associated with multi-talker decoding [11]. An E2E A/V M-T approach has recently been applied to addressing the multi-speaker cocktail party effect [12]. There is also a large body of work on speech separation where the goal is to recover a target speech signal in the presence of background speech [13,14,6].…”
Section: Introductionmentioning
confidence: 99%
“…Two types of BC loss are investigated as follows. For example, a SOT reference consisting of N sentences and the speaker change token has been removed can be given as {y 1 1 , ..., y 1 L 1 , ..., y N 1 , ..., y N L N } where y n k means the k-th token in n-th utterance and L n represent the token number of n-th utterance.…”
Section: Boundary Constraint Lossmentioning
confidence: 99%
“…Recently, there has been a lot of exploration in the field of multi-party meetings scenarios [1,2,3,4,5]. Progress has also been advanced with several challenges [6,7,8,9,10,11] and datasets [12,13,14,15,16] specifically focusing on this field.…”
Section: Introductionmentioning
confidence: 99%
“…(AV) multi-modal has been applied widely in speech community [6][7][8][9][10][11][12]. The visual information obtained by analyzing lip shapes or facial expressions of the visual modality is more robust than the audio information from complex scenarios.…”
Section: Introductionmentioning
confidence: 99%