Multimodal Machine Translation through Visuals and Speech

Sulubacak, Umut; Çağlayan, Ozan; Grönroos, Stig-Arne; Rouhe, Aku; Elliott, Desmond; Specia, Lucia; Tiedemann, Jörg

doi:10.48550/arxiv.1911.12798

Cited by 4 publications

(4 citation statements)

References 111 publications

(145 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although it was natural, not least in light of early prototypes for telephone interpreting (Kurematsu and Morimoto 1996;Wahlster 2000b), for research and development to focus on the verbal and, to some extent, paralinguistic components of speech, recent work has also begun to consider multimodal input (e.g., Sulubacak et al 2019). For instance, visual input from automatic lip reading could be used to support the ASR process, and image guidance more generally could be relevant to applications such as automatic subtitling (speech-to-text).…”

Section: Research To Datementioning

confidence: 99%

Is machine interpreting interpreting?

Pöchhacker

2024

View full text Add to dashboard Cite

This article first considers the question whether machine translation is translation and moves on to address the analogous issue for interpreting. After a review of the development and state of the art in machine interpreting, more commonly referred to as ‘spoken language translation’ or ‘speech translation’, the question of whether machine interpreting is interpreting is discussed – first with regard to terminology and conceptual distinctions and then in broader translation-theoretical frameworks. Using Otto Kade’s early definitional proposal as a point of departure, a reconceptualization is proposed in the form of a three-dimensional model designed to go beyond rigid taxonomies. The dimensions of agency, embodiment and immediacy are used to characterize translation as a graded concept in which these features may be more or less prominent.

show abstract

Section: Research To Datementioning

confidence: 99%

Is machine interpreting interpreting?

Pöchhacker

2024

View full text Add to dashboard Cite

show abstract

“…Inspired by studies of human perception, multimodal processing is spreading into many traditional areas of research, e.g., machine translation (Sulubacak et al, 2019) and ASR . It has become an important part of new areas of research such as image captioning (Bernardi et al, 2016), visual question-answering (VQA; (Antol et al, 2015)), and multimodal summarization .…”

Section: Related Workmentioning

confidence: 99%

Fine-Grained Grounding for Multimodal Speech Recognition

Srinivasan¹,

Sanabria²,

Metze³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals. 1

show abstract

“…The multimodal model is able to represent and discover hidden relations between different modalities and possibly recover the complementary information which can not be captured by a uni-modal approach and implicit interactions. Such skills are also necessary for natural language processing [45] to achieve human-level comprehension in a variety of AI jobs. The multimedia data shows not only the relationship between users and items but also reflect the preference of user in different modalities.…”

Section: Introductionmentioning

confidence: 99%

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Zhou¹,

Zhou²,

Zeng³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recommendation systems have become popular and effective tools to help users discover their interesting items by modeling the user preference and item property based on implicit interactions (e.g., purchasing and clicking). Humans perceive the world by processing the modality signals (e.g., audio, text and image), which inspired researchers to build a recommender system that can understand and interpret data from different modalities. Those models could capture the hidden relations between different modalities and possibly recover the complementary information which can not be captured by a uni-modal approach and implicit interactions. The goal of this survey is to provide a comprehensive review of the recent research efforts on the multimodal recommendation. Specifically, it shows a clear pipeline with commonly used techniques in each step and classifies the models by the methods used. Additionally, a code framework has been designed that helps researchers new in this area to understand the principles and techniques, and easily runs the SOTA models. Our framework is located at: https://github.com/enoche/MMRec.

show abstract

Multimodal Machine Translation through Visuals and Speech

Cited by 4 publications

References 111 publications

Is machine interpreting interpreting?

Is machine interpreting interpreting?

Fine-Grained Grounding for Multimodal Speech Recognition

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Contact Info

Product

Resources

About