Looking Enhances Listening: Recovering Missing Speech Using Images

Srinivasan, Tejas; Sanabria, Ramon; Metze, Florian

doi:10.1109/icassp40776.2020.9053397

Cited by 10 publications

(21 citation statements)

References 19 publications

(41 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…via conceptual captions (Sharma et al, 2018)) or further broadened to include audio (Tsai et al, 2019). Vision can also help ground speech signals (Srinivasan et al, 2020;Harwath et al, 2019) to facilitate discovery of linguistic concepts (Harwath et al, 2020).…”

Section: Ws3: the World Of Sights And Soundsmentioning

confidence: 99%

Experience Grounds Language

Bisk¹,

Holtzman²,

Thomason³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

161

136

View full text Add to dashboard Cite

Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utterances meaningful.Natural language processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.

show abstract

Section: Ws3: the World Of Sights And Soundsmentioning

confidence: 99%

Experience Grounds Language

Bisk¹,

Holtzman²,

Thomason³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

161

136

View full text Add to dashboard Cite

show abstract

“…We simulate a degradation of the audio signal during training by randomly masking words with silence. This approach extends Srinivasan et al (2020), where they masked a fixed set of words corresponding to entities, i.e., objects and places. The justification for random word masking, as opposed to entity masking, is that noise in audio signals is unlikely to systematically occur when someone is speaking about an entity.…”

Section: Audio Maskingmentioning

confidence: 99%

“…Our model development (and the associated results) is conducted on the development set of the Flickr8K Audio Captions Corpus; the rest of our analysis is conducted on the test set. We report Word Error Rate (WER) for all our models, and for datasets with masked audio, we compute Recovery Rate (RR) (Srinivasan et al, 2020), which measures the percentage of masked words in the dataset that are correctly recovered in the transcription: RR = |correctly transcribed masked words| |masked words in dataset|…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…Ramakrishnan et al (2018) and Grand and Belinkov (2019) showed that traditional VQA neural architectures ignore the visual context and focus on linguistic biases of the dataset. More related to our work are the studies of Srinivasan et al (2020) and , which explore how multimodal models use image information under noisy scenarios. These studies conclude that when certain nouns are dropped from the dominant language modality, multimodal models are capable of properly using the semantics provided by the image.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Fine-Grained Grounding for Multimodal Speech Recognition

Srinivasan

Sanabria

Metze

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

View full text Add to dashboard Cite

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals. 1

show abstract

“…In this work, we study multimodal ASR in more realistic noisy scenarios. We follow the methodology from (Srinivasan et al, 2020) 2: Our unimodal ASR model, along with several of our fusion methods for integrating a visual context vector (in blue) into the ASR model. The two fusion methods not displayed above, Weighted-DF and Middle-DF, were constructed similar to Early-DF and HierAttn-DF respectively mask words in an unstructured manner in the audio signal (we refer to this as RandWordMask).…”

Section: Introductionmentioning

confidence: 99%

Multimodal Speech Recognition with Unstructured Audio Masking

Srinivasan¹,

Sanabria²,

Metze³

et al. 2020

Proceedings of the First International Workshop on Natural Language Processing Beyond Text

Self Cite

View full text Add to dashboard Cite

Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called Rand-WordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.

show abstract

Looking Enhances Listening: Recovering Missing Speech Using Images

Cited by 10 publications

References 19 publications

Experience Grounds Language

Experience Grounds Language

Fine-Grained Grounding for Multimodal Speech Recognition

Multimodal Speech Recognition with Unstructured Audio Masking

Contact Info

Product

Resources

About