A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Braga, Otavio; Siohan, Olivier

doi:10.1109/icassp39728.2021.9414160

Cited by 8 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We present a multi-task learning (MTL) [12] setup for a model that can simultaneously perform audio-visual ASR and active speaker detection, improving previous work on multiperson audio-visual ASR. We show that combining the two tasks is enough to significantly improve the performance of the model in the ASD task relative to the baseline in [10,11], while not degrading the ASR performance of the same model trained exclusively for ASR.…”

Section: Introductionmentioning

confidence: 93%

“…The exact parameters of the Con-vNet can be found on Table 1. This is an important deviation from [11], where blocks of 3D convolutions were used instead. (2+1)D convolutions not only yield better performance, but are also less TPU memory intensive, allowing training with larger batch sizes, which has shown to be particularly important for obtaining lower word error rates.…”

Section: 1mentioning

confidence: 99%

“…We use 8 attention heads, each with dimension of 64, and a model dimension of 1024. This is another important improvement from [11], where a 6-layer BiLSTM encoder was used instead.…”

Section: Asr Modelmentioning

confidence: 99%

“…In a realistic setting, however, multiple faces are potentially simultaneously on screen and one must decide which speaker face video to feed to the model. This first step has been traditionally treated as a separate problem [8,9], but recent work has shown that we can obtain this association with an end-to-end model employing an attention mechanism over the candidate faces [10,11].…”

Section: Introductionmentioning

confidence: 99%

“…As a side effect, this attention mechanism implicitly captures the correspondence between the audio and the video of the active speaker and, thus, can also be used as an active speaker detection (ASD) model in addition to ASR. However, while interesting in itself, despite having nontrivial accuracy, the implicit association doesn't perform as well as when the attention is trained explicitly for active speaker detection (for a detailed study, see [11]). ASD essentially pro-1 obraga@google.com vides a strong signal for diarization, and high ASD accuracy as a side-product is a compelling reason to include the visual signal in ASR models.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Best of Both Worlds: Multi-Task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Braga¹,

Siohan²

2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.

show abstract

Section: Introductionmentioning

confidence: 93%

Section: 1mentioning

confidence: 99%

“…We use 8 attention heads, each with dimension of 64, and a model dimension of 1024. This is another important improvement from [11], where a 6-layer BiLSTM encoder was used instead.…”

Section: Asr Modelmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Best of Both Worlds: Multi-Task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Braga¹,

Siohan²

2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

Audio-Visual Multi-person Keyword Spotting via Hybrid Fusion

Su¹,

Miao²,

Liu³

2022

Artificial Intelligence

View full text Add to dashboard Cite

On‐device audio‐visual multi‐person wake word spotting

Wang

Chen

et al. 2023

CAAI Trans on Intel Tech

View full text Add to dashboard Cite

Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance. However, most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity. Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments. In this paper, a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting. Firstly, an attention‐based audio‐visual voice activity detection module is presented, which generates an attention score matrix of audio and visual representations to derive active speaker representation. Secondly, the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model. Moreover, a new audio‐visual dataset, PKU‐KWS, is collected for sentence‐level multi‐person wake word spotting. Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.

show abstract

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Cited by 8 publications

References 17 publications

Best of Both Worlds: Multi-Task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Best of Both Worlds: Multi-Task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Audio-Visual Multi-person Keyword Spotting via Hybrid Fusion

On‐device audio‐visual multi‐person wake word spotting

Contact Info

Product

Resources

About