ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414160
|View full text |Cite
|
Sign up to set email alerts
|

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Abstract: Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection probl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 8 publications
(12 citation statements)
references
References 17 publications
0
12
0
Order By: Relevance
“…We present a multi-task learning (MTL) [12] setup for a model that can simultaneously perform audio-visual ASR and active speaker detection, improving previous work on multiperson audio-visual ASR. We show that combining the two tasks is enough to significantly improve the performance of the model in the ASD task relative to the baseline in [10,11], while not degrading the ASR performance of the same model trained exclusively for ASR.…”
Section: Introductionmentioning
confidence: 93%
See 4 more Smart Citations
“…We present a multi-task learning (MTL) [12] setup for a model that can simultaneously perform audio-visual ASR and active speaker detection, improving previous work on multiperson audio-visual ASR. We show that combining the two tasks is enough to significantly improve the performance of the model in the ASD task relative to the baseline in [10,11], while not degrading the ASR performance of the same model trained exclusively for ASR.…”
Section: Introductionmentioning
confidence: 93%
“…The exact parameters of the Con-vNet can be found on Table 1. This is an important deviation from [11], where blocks of 3D convolutions were used instead. (2+1)D convolutions not only yield better performance, but are also less TPU memory intensive, allowing training with larger batch sizes, which has shown to be particularly important for obtaining lower word error rates.…”
Section: 1mentioning
confidence: 99%
See 3 more Smart Citations