Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Srinivasan, Tejas; Sanabria, Ramon; Metze, Florian

doi:10.48550/arxiv.1907.00477

Cited by 2 publications

(3 citation statements)

References 25 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Graphical representation of the audio-visual speech recognition pipeline is given in Figure 4. The use of this modality is focused on improving the performance of ASR under both clean and noisy conditions (Wei, Zhang, Hou, & Dai, 2020;Srinivasan, Sanabria, & Metze, 2019). Commonly, such recognition systems adopt visual features extracted from the speaker's mouth region.…”

Section: Automatic Speech Recognitionmentioning

confidence: 99%

“…& Siohan, 2019;Braga, Makino, Siohan, & Liao, 2020). Srinivasan et al (2019) analyzed to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. Experimental results show that all of the considered multimodal models i.e., hierarchical feature attention, encoder initialization, early decoder fusion, and encoder-decoder initialization considerably outperform the unimodal baseline model (sequence-to-sequence model with attention) on the full unmasked test set.…”

Section: Automatic Speech Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning

Erdem

Kuyu

Yagcioglu

et al. 2022

jair

View full text Add to dashboard Cite

Developing artificial learning systems that can understand and generate natural language has been one of the long-standing goals of artificial intelligence. Recent decades have witnessed an impressive progress on both of these problems, giving rise to a new family of approaches. Especially, the advances in deep learning over the past couple of years have led to neural approaches to natural language generation (NLG). These methods combine generative language learning techniques with neural-networks based frameworks. With a wide range of applications in natural language processing, neural NLG (NNLG) is a new and fast growing field of research. In this state-of-the-art report, we investigate the recent developments and applications of NNLG in its full extent from a multidimensional view, covering critical perspectives such as multimodality, multilinguality, controllability and learning strategies. We summarize the fundamental building blocks of NNLG approaches from these aspects and provide detailed reviews of commonly used preprocessing steps and basic neural architectures. This report also focuses on the seminal applications of these NNLG models such as machine translation, description generation, automatic speech recognition, abstractive summarization, text simplification, question answering and generation, and dialogue generation. Finally, we conclude with a thorough discussion of the described frameworks by pointing out some open research directions.

show abstract

Section: Automatic Speech Recognitionmentioning

confidence: 99%

Section: Automatic Speech Recognitionmentioning

confidence: 99%

Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning

Erdem

Kuyu

Yagcioglu

et al. 2022

jair

View full text Add to dashboard Cite

show abstract

“…Previous work has shown that the audio signal needs to be degraded during training in order to utilize the visual context (Srinivasan et al, 2019). We simulate a degradation of the audio signal during training by randomly masking words with silence.…”

Section: Audio Maskingmentioning

confidence: 99%

Fine-Grained Grounding for Multimodal Speech Recognition

Srinivasan¹,

Sanabria²,

Metze³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals. 1

show abstract

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Cited by 2 publications

References 25 publications

Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning

Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning

Fine-Grained Grounding for Multimodal Speech Recognition

Contact Info

Product

Resources

About