Probing the Need for Visual Context in Multimodal Machine Translation

Çağlayan, Ozan; Madhyastha, Pranava; Specia, Lucia; Barrault, Loïc

doi:10.48550/arxiv.1903.08678

Cited by 8 publications

(16 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of the existing works are not capable of generating multi-modal summaries 11 . The systems that do generate multi-modal summaries either have an inbuilt system capable to generating multimodal output (mainly by generating text using seq2seq mechanisms and selecting relevant images) [61,134] or they adopt some post-processing steps to obtain the visual and vocal supplements of the generated textual summaries [44,133].…”

Section: Post-processingmentioning

confidence: 99%

“…Information in the form of multi-modal inputs has been leveraged in many tasks other than summarization including multi-modal machine translation [11,21,22,39,108], multi-modal movement prediction [18,53,120], product classification in e-commerce [128], multi-modal interactive artificial intelligence frameworks [51], multi-modal emoji prediction [5,17], multi-modal frame identification [10], multi-modal financial risk forecasting [59,101], multi-modal sentiment analysis [79,93,122], multi-modal named identity recognition [2,77,78,109,126,130], multi-modal video description generation [37,38,91], multi-modal product title compression [70] and multi-modal biometric authentication [28,42,106]. The shear number of application possibilities for multi-modal information processing and retrieval tasks are quite impressive.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Multi-modal Summarization

Jangra¹,

Mukherjee²,

Jatowt³

et al. 2021

Preprint

View full text Add to dashboard Cite

The new era of technology has brought us to the point where it is convenient for people to share their opinions over an abundance of platforms. These platforms have a provision for the users to express themselves in multiple forms of representations, including text, images, videos, and audio. This, however, makes it difficult for users to obtain all the key information about a topic, making the task of automatic multi-modal summarization (MMS) essential. In this paper, we present a comprehensive survey of the existing research in the area of MMS.

show abstract

Section: Post-processingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Survey on Multi-modal Summarization

Jangra¹,

Mukherjee²,

Jatowt³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While much of work in Multimodal Machine Translation (MMT) has suggested that the visual modality is at best marginally beneficial (Barrault et al, 2018;Elliott, 2018), recent work (Caglayan et al, 2019a) suggests that visual information is useful when there is missing information in the source-side signal. We hypothesize that the same could hold true for Multimodal ASR, under conditions when the acoustic speech is corrupted.…”

Section: Introductionmentioning

confidence: 99%

“…Inspired by Caglayan et al (2019a), in this work we port a similar set of experiments to MMASR, where we analyze the contribution of the visual modality to different input signal corruption in the primary modality (i.e. acoustic signal) on state-of-the-art MMASR architectures (Sanabria et al, 2018;Caglayan et al, 2019b).…”

Section: Introductionmentioning

confidence: 99%

“…acoustic signal) on state-of-the-art MMASR architectures (Sanabria et al, 2018;Caglayan et al, 2019b). Similar to Caglayan et al (2019a), we perform three types of masking, by replacing specific words in the acoustic signal with silence during inference time (Section 2.2). We also analyze the sensitivity of the model to the visual modality similar to Elliott (2018), by deliberately misaligning the audio and visual inputs in our test set (Section 2.3).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Srinivasan¹,

Sanabria²,

Metze³

2019

Preprint

View full text Add to dashboard Cite

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inference time. Our experiments show that while MMASR models show significant gains over traditional speechto-text architectures (upto 4.2% WER improvements), they do not incorporate visual information when the audio signal has been corrupted. This shows that current methods of integrating the visual modality do not improve model robustness to noise, and we need better visually grounded adaptation techniques.

show abstract

On Leveraging the Visual Modality for Neural Machine Translation

Raunak¹,

Choe²,

Lu³

et al. 2019

Proceedings of the 12th International Conference on Natural Language Generation

View full text Add to dashboard Cite

Leveraging the visual modality effectively for Neural Machine Translation (NMT) remains an open problem in computational linguistics. Recently, Caglayan et al. posit that the observed gains are limited mainly due to the very simple, short, repetitive sentences of the Multi30k dataset (the only multimodal MT dataset available at the time), which renders the source text sufficient for context. In this work, we further investigate this hypothesis on a new large scale multimodal Machine Translation (MMT) dataset, How2, which has 1.57 times longer mean sentence length than Multi30k and no repetition. We propose and evaluate three novel fusion techniques, each of which is designed to ensure the utilization of visual context at different stages of the Sequence-to-Sequence transduction pipeline, even under full linguistic context. However, we still obtain only marginal gains under full linguistic context and posit that visual embeddings extracted from deep vision models (ResNet for Multi30k, ResNext for How2) do not lend themselves to increasing the discriminativeness between the vocabulary elements at token level prediction in NMT. We demonstrate this qualitatively by analyzing attention distribution and quantitatively through Principal Component Analysis, arriving at the conclusion that it is the quality of the visual embeddings rather than the length of sentences, which need to be improved in existing MMT datasets.

show abstract

Probing the Need for Visual Context in Multimodal Machine Translation

Cited by 8 publications

References 0 publications

A Survey on Multi-modal Summarization

A Survey on Multi-modal Summarization

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

On Leveraging the Visual Modality for Neural Machine Translation

Contact Info

Product

Resources

About