2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462439
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end Multimodal Speech Recognition

Abstract: Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data. In previous work, we have shown that the visual channel -specifically object and scene features -can help to adapt the acoustic model (AM) and language model (LM) of a recognizer, and we are now expanding this work to end-to-end approaches. In the case of a Connectionist Te… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
28
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
1
1

Relationship

3
5

Authors

Journals

citations
Cited by 31 publications
(30 citation statements)
references
References 35 publications
(54 reference statements)
0
28
0
Order By: Relevance
“…when the video consistently provides object, action and scene level cues correlated with the speech content as may be the case with instructional videos. Here, visual cues from the recording environment (indoor vs outdoor) or the interaction between salient objects (people, instruments, vehicles, tools and equipments) can be exploited by the recognizer in various ways to learn a better acoustic and/or language model [3,4,5]. Figure 1 shows such an example where an ASR system without access to visual modality can produce an homophonic utterance like eucalylie instead of the rarely occurring correct word ukulele.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…when the video consistently provides object, action and scene level cues correlated with the speech content as may be the case with instructional videos. Here, visual cues from the recording environment (indoor vs outdoor) or the interaction between salient objects (people, instruments, vehicles, tools and equipments) can be exploited by the recognizer in various ways to learn a better acoustic and/or language model [3,4,5]. Figure 1 shows such an example where an ASR system without access to visual modality can produce an homophonic utterance like eucalylie instead of the rarely occurring correct word ukulele.…”
Section: Introductionmentioning
confidence: 99%
“…In this paper, we first apply an adaptive training scheme [3,4,5] for sequence-to-sequence (S2S) speech recognition and then propose two novel multimodal grounding methods for S2S ASR inspired from previous work in image captioning [6] and multimodal neural machine translation (MMT) [7,8]. We compare both approaches through the use of visual features extracted from pre-trained models trained for object, scene and action recognition tasks [9,10,11].…”
Section: Introductionmentioning
confidence: 99%
“…Second, it is possible that the ResNet posteriors are either extremely noisy, or simply fail to identify certain relevant objects because of a domain mismatch, as discussed in Section 3.1. Previous work in the context of ASR shows that using the penultimate instead of the last layer of the ResNet makes little difference [9].…”
Section: Discussionmentioning
confidence: 99%
“…In audio-visual speech recognition, [9,10] explore strategies to learn and fuse audio and visual representations in a neural net, including concatenating both modalities, bilinear products between representations, and weighted addition of modalities. Within a scene classification task, [11] use one neural network per modality.…”
Section: Related Workmentioning
confidence: 99%
“…Consequently, visual modality integration has recently become a trend in the speech and natural language processing communities. Previous works show improvements in the domains of visual question-answering [1], multimodal machine translation [2], visual dialog [3], and automatic speech recognition (ASR) [4]. Although there are several visual adaption approaches for ASR [4,5,6,7,8,9], it is still unclear how the models leverage the visual modality.…”
Section: Introductionmentioning
confidence: 99%