Multimodal Grounding for Sequence-to-sequence Speech Recognition

Çağlayan, Ozan; Sanabria, Ramon; Palaskar, Shruti; Barraul, Loic; Metze, Florian

doi:10.1109/icassp.2019.8682750

Cited by 18 publications

(22 citation statements)

References 25 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our AV-ASR system yields gains > 3% over audio only models for both subword and multiresolution predictions. Finally, we observe that while the Listen, Attend and Spell-based architecture of (Caglayan et al, 2019) is consistent across models. It is important to note that our models are trained end-to-end with both audio and video features.…”

Section: Resultsmentioning

confidence: 54%

“…For example, a basketball court is more likely to include the term "lay-up" whereas an office place is more likely include the term "layoff". This approach can boost ASR performance, while the requirements for video input are kept relaxed (Caglayan et al, 2019;Hsu et al, 2019). Additionally we consider a multiresolution loss that takes into account transcriptions at the character and subword level.…”

Section: Introductionmentioning

confidence: 99%

“…Multimodal grounding for ASR systems has been explored in (Caglayan et al, 2019), where a pretrained RNN-based ASR model is finetuned with visual information through Visual Adaptive Training. Sterpu et al (2018) propose a seq2seq model based on RNNs for lip-reading that performs cross-modal alignment of face tracking and audio features through an attention mechanism.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multimodal and Multiresolution Speech Recognition with Transformers

Paraskevopoulos¹,

Parthasarathy²,

Khare³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We particularly focus on the scene context provided by the visual information, to ground the ASR. We extract representations for audio features in the encoder layers of the transformer and fuse video features using an additional crossmodal multihead attention layer. Additionally, we incorporate a multitask training criterion for multiresolution ASR, where we train the model to generate both character and subword level transcriptions. Experimental results on the How2 dataset, indicate that multiresolution training can speed up convergence by around 50% and relatively improves word error rate (WER) performance by upto 18% over subword prediction models. Further, incorporating visual information improves performance with relative gains upto 3.76% over audio only models. Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.

show abstract

Section: Resultsmentioning

confidence: 54%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal and Multiresolution Speech Recognition with Transformers

Paraskevopoulos¹,

Parthasarathy²,

Khare³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…However, previous work has suggested that visual modality is only being used as a regularization signal [5]. To investigate this behavior, we also measure the models' ability to recover masked words -ensuring that the observed improvements come from the semantics of the visual context.…”

Section: Resultsmentioning

confidence: 99%

“…Previous works show improvements in the domains of visual question-answering [1], multimodal machine translation [2], visual dialog [3], and automatic speech recognition (ASR) [4]. Although there are several visual adaption approaches for ASR [4,5,6,7,8,9], it is still unclear how the models leverage the visual modality.…”

Section: Introductionmentioning

confidence: 99%

Looking Enhances Listening: Recovering Missing Speech Using Images

Srinivasan

Sanabria

Metze

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

show abstract

A Collaborative Multi-modal Fusion Method Based on Random Variational Information Bottleneck for Gesture Recognition

Chen

et al. 2021

MultiMedia Modeling

View full text Add to dashboard Cite

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Cited by 18 publications

References 25 publications

Multimodal and Multiresolution Speech Recognition with Transformers

Multimodal and Multiresolution Speech Recognition with Transformers

Looking Enhances Listening: Recovering Missing Speech Using Images

A Collaborative Multi-modal Fusion Method Based on Random Variational Information Bottleneck for Gesture Recognition

Contact Info

Product

Resources

About