2019
DOI: 10.48550/arxiv.1911.12798
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Machine Translation through Visuals and Speech

Abstract: Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 111 publications
(145 reference statements)
0
4
0
Order By: Relevance
“…Although it was natural, not least in light of early prototypes for telephone interpreting (Kurematsu and Morimoto 1996;Wahlster 2000b), for research and development to focus on the verbal and, to some extent, paralinguistic components of speech, recent work has also begun to consider multimodal input (e.g., Sulubacak et al 2019). For instance, visual input from automatic lip reading could be used to support the ASR process, and image guidance more generally could be relevant to applications such as automatic subtitling (speech-to-text).…”
Section: Research To Datementioning
confidence: 99%
“…Although it was natural, not least in light of early prototypes for telephone interpreting (Kurematsu and Morimoto 1996;Wahlster 2000b), for research and development to focus on the verbal and, to some extent, paralinguistic components of speech, recent work has also begun to consider multimodal input (e.g., Sulubacak et al 2019). For instance, visual input from automatic lip reading could be used to support the ASR process, and image guidance more generally could be relevant to applications such as automatic subtitling (speech-to-text).…”
Section: Research To Datementioning
confidence: 99%
“…Inspired by studies of human perception, multimodal processing is spreading into many traditional areas of research, e.g., machine translation (Sulubacak et al, 2019) and ASR . It has become an important part of new areas of research such as image captioning (Bernardi et al, 2016), visual question-answering (VQA; (Antol et al, 2015)), and multimodal summarization .…”
Section: Related Workmentioning
confidence: 99%
“…The multimodal model is able to represent and discover hidden relations between different modalities and possibly recover the complementary information which can not be captured by a uni-modal approach and implicit interactions. Such skills are also necessary for natural language processing [45] to achieve human-level comprehension in a variety of AI jobs. The multimedia data shows not only the relationship between users and items but also reflect the preference of user in different modalities.…”
Section: Introductionmentioning
confidence: 99%