Multimodal machine translation through visuals and speech

Sulubacak, Umut; Çağlayan, Ozan; Grönroos, Stig-Arne; Rouhe, Aku; Elliott, Desmond; Specia, Lucia; Tiedemann, Jörg

doi:10.1007/s10590-020-09250-0

Cited by 48 publications

(37 citation statements)

References 150 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MMT aims to improve the quality of automatic translation using auxiliary sources of information (Sulubacak et al, 2020). The most typical framework explored in previous work makes use of the images when translating their descriptions between languages, with the hypothesis that visual grounding could provide contextual cues to resolve linguistic phenomena such as word-sense disambiguation or gender marking.…”

Section: Multimodal Machine Translation (Mmt)mentioning

confidence: 99%

Simultaneous Machine Translation with Visual Context

Çağlayan

Ive

Haralampieva

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. The translation thus has to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this paper, we seek to understand whether the addition of visual information can compensate for the missing source context. To this end, we analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks. Our results show that visual context is helpful and that visually-grounded models based on explicit object region information are much better than commonly used global features, reaching up to 3 BLEU points improvement under low latency scenarios. Our qualitative analysis illustrates cases where only the multimodal systems are able to translate correctly from English into gender-marked languages, as well as deal with differences in word order, such as adjective-noun placement between English and French.

show abstract

Section: Multimodal Machine Translation (Mmt)mentioning

confidence: 99%

Simultaneous Machine Translation with Visual Context

Çağlayan

Ive

Haralampieva

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…For a good comparison of empirical results, which are not the focus of this paper, we refer to concurrent work(Sulubacak et al, 2019). Moreover, for conciseness we do not cover the sub-topic of simultaneous translation(Fügen, 2008).…”

mentioning

confidence: 99%

Speech Translation and the End-to-End Promise: Taking Stock of Where We Are

Sperber

Paulik

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Over its three decade history, speech translation has experienced several shifts in its primary research themes; moving from loosely coupled cascades of speech recognition and machine translation, to exploring questions of tight coupling, and finally to end-to-end models that have recently attracted much attention. This paper provides a brief survey of these developments, along with a discussion of the main challenges of traditional approaches which stem from committing to intermediate representations from the speech recognizer, and from training cascaded models separately towards different objectives.Recent end-to-end modeling techniques promise a principled way of overcoming these issues by allowing joint training of all model components and removing the need for explicit intermediate representations. However, a closer look reveals that many end-to-end models fall short of solving these issues, due to compromises made to address data scarcity. This paper provides a unifying categorization and nomenclature that covers both traditional and recent approaches and that may help researchers by highlighting both trade-offs and open research questions.

show abstract

“…Inspired by studies of human perception, multimodal processing is spreading into many traditional areas of research, e.g., machine translation (Sulubacak et al, 2019) and ASR . It has become an important part of new areas of research such as image captioning (Bernardi et al, 2016), visual question-answering (VQA; (Antol et al, 2015)), and multimodal summarization .…”

Section: Related Workmentioning

confidence: 99%

Fine-Grained Grounding for Multimodal Speech Recognition

Srinivasan

Sanabria

Metze

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

Self Cite

View full text Add to dashboard Cite

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals. 1

show abstract

Multimodal machine translation through visuals and speech

Cited by 48 publications

References 150 publications

Simultaneous Machine Translation with Visual Context

Simultaneous Machine Translation with Visual Context

Speech Translation and the End-to-End Promise: Taking Stock of Where We Are

Fine-Grained Grounding for Multimodal Speech Recognition

Contact Info

Product

Resources

About