ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682750
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Grounding for Sequence-to-sequence Speech Recognition

Abstract: Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall named entities. Motivated by this, there have been many works studying the integration of visual information into the speech recognition pipeline. Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accura… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
20
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 18 publications
(22 citation statements)
references
References 25 publications
(47 reference statements)
2
20
0
Order By: Relevance
“…Our AV-ASR system yields gains > 3% over audio only models for both subword and multiresolution predictions. Finally, we observe that while the Listen, Attend and Spell-based architecture of (Caglayan et al, 2019) is consistent across models. It is important to note that our models are trained end-to-end with both audio and video features.…”
Section: Resultsmentioning
confidence: 54%
See 2 more Smart Citations
“…Our AV-ASR system yields gains > 3% over audio only models for both subword and multiresolution predictions. Finally, we observe that while the Listen, Attend and Spell-based architecture of (Caglayan et al, 2019) is consistent across models. It is important to note that our models are trained end-to-end with both audio and video features.…”
Section: Resultsmentioning
confidence: 54%
“…For example, a basketball court is more likely to include the term "lay-up" whereas an office place is more likely include the term "layoff". This approach can boost ASR performance, while the requirements for video input are kept relaxed (Caglayan et al, 2019;Hsu et al, 2019). Additionally we consider a multiresolution loss that takes into account transcriptions at the character and subword level.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, previous work has suggested that visual modality is only being used as a regularization signal [5]. To investigate this behavior, we also measure the models' ability to recover masked words -ensuring that the observed improvements come from the semantics of the visual context.…”
Section: Resultsmentioning
confidence: 99%
“…Previous works show improvements in the domains of visual question-answering [1], multimodal machine translation [2], visual dialog [3], and automatic speech recognition (ASR) [4]. Although there are several visual adaption approaches for ASR [4,5,6,7,8,9], it is still unclear how the models leverage the visual modality.…”
Section: Introductionmentioning
confidence: 99%