2019
DOI: 10.48550/arxiv.1907.00477
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Abstract: Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inf… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 25 publications
(39 reference statements)
0
3
0
Order By: Relevance
“…Graphical representation of the audio-visual speech recognition pipeline is given in Figure 4. The use of this modality is focused on improving the performance of ASR under both clean and noisy conditions (Wei, Zhang, Hou, & Dai, 2020;Srinivasan, Sanabria, & Metze, 2019). Commonly, such recognition systems adopt visual features extracted from the speaker's mouth region.…”
Section: Automatic Speech Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Graphical representation of the audio-visual speech recognition pipeline is given in Figure 4. The use of this modality is focused on improving the performance of ASR under both clean and noisy conditions (Wei, Zhang, Hou, & Dai, 2020;Srinivasan, Sanabria, & Metze, 2019). Commonly, such recognition systems adopt visual features extracted from the speaker's mouth region.…”
Section: Automatic Speech Recognitionmentioning
confidence: 99%
“…& Siohan, 2019;Braga, Makino, Siohan, & Liao, 2020). Srinivasan et al (2019) analyzed to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. Experimental results show that all of the considered multimodal models i.e., hierarchical feature attention, encoder initialization, early decoder fusion, and encoder-decoder initialization considerably outperform the unimodal baseline model (sequence-to-sequence model with attention) on the full unmasked test set.…”
Section: Automatic Speech Recognitionmentioning
confidence: 99%
“…Previous work has shown that the audio signal needs to be degraded during training in order to utilize the visual context (Srinivasan et al, 2019). We simulate a degradation of the audio signal during training by randomly masking words with silence.…”
Section: Audio Maskingmentioning
confidence: 99%