ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053397
|View full text |Cite
|
Sign up to set email alerts
|

Looking Enhances Listening: Recovering Missing Speech Using Images

Abstract: Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

2
6

Authors

Journals

citations
Cited by 10 publications
(21 citation statements)
references
References 19 publications
(41 reference statements)
0
20
0
Order By: Relevance
“…via conceptual captions (Sharma et al, 2018)) or further broadened to include audio (Tsai et al, 2019). Vision can also help ground speech signals (Srinivasan et al, 2020;Harwath et al, 2019) to facilitate discovery of linguistic concepts (Harwath et al, 2020).…”
Section: Ws3: the World Of Sights And Soundsmentioning
confidence: 99%
“…via conceptual captions (Sharma et al, 2018)) or further broadened to include audio (Tsai et al, 2019). Vision can also help ground speech signals (Srinivasan et al, 2020;Harwath et al, 2019) to facilitate discovery of linguistic concepts (Harwath et al, 2020).…”
Section: Ws3: the World Of Sights And Soundsmentioning
confidence: 99%
“…We simulate a degradation of the audio signal during training by randomly masking words with silence. This approach extends Srinivasan et al (2020), where they masked a fixed set of words corresponding to entities, i.e., objects and places. The justification for random word masking, as opposed to entity masking, is that noise in audio signals is unlikely to systematically occur when someone is speaking about an entity.…”
Section: Audio Maskingmentioning
confidence: 99%
“…Our model development (and the associated results) is conducted on the development set of the Flickr8K Audio Captions Corpus; the rest of our analysis is conducted on the test set. We report Word Error Rate (WER) for all our models, and for datasets with masked audio, we compute Recovery Rate (RR) (Srinivasan et al, 2020), which measures the percentage of masked words in the dataset that are correctly recovered in the transcription: RR = |correctly transcribed masked words| |masked words in dataset|…”
Section: Evaluation Metricsmentioning
confidence: 99%
See 1 more Smart Citation
“…In this work, we study multimodal ASR in more realistic noisy scenarios. We follow the methodology from (Srinivasan et al, 2020) 2: Our unimodal ASR model, along with several of our fusion methods for integrating a visual context vector (in blue) into the ASR model. The two fusion methods not displayed above, Weighted-DF and Middle-DF, were constructed similar to Early-DF and HierAttn-DF respectively mask words in an unstructured manner in the audio signal (we refer to this as RandWordMask).…”
Section: Introductionmentioning
confidence: 99%