2019
DOI: 10.29007/hxhn
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Neural Machine Translation Using CNN and Transformer Encoder

Abstract: Multimodal machine translation uses images related to source language sentences as inputs to improve translation quality. Pre-existing multimodal neural machine translation (NMT) models that incorporate the visual features of each image region into an encoder for source language sentences or an attention mechanism between an encoder and a decoder cannot catch the relationship between the visual features from each image region. This paper proposes a new multimodal NMT model that encodes an input image using a c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
2

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 10 publications
(15 reference statements)
0
1
0
1
Order By: Relevance
“…8 Grönroos et al 2018;Ive et al 2019;Zhang et al 2020) (Delbrouck and Dupont 2017;Zhang et al 2020) (Calixto et al 2017;Helcl et al 2018;Libovický et al 2018;Ive et al 2019;Takushima et al 2019)…”
Section: 2unclassified
“…8 Grönroos et al 2018;Ive et al 2019;Zhang et al 2020) (Delbrouck and Dupont 2017;Zhang et al 2020) (Calixto et al 2017;Helcl et al 2018;Libovický et al 2018;Ive et al 2019;Takushima et al 2019)…”
Section: 2unclassified
“…Most MNMT models have incorporated an input image's features with a visual attention mechanism. Some studies have introduced a visual attention mechanism that captures relationships between source language words and image regions (Delbrouck and Dupont, 2017;Zhang et al, 2020), while others have used a visual attention mechanism that captures relationships between target language words and image regions (Calixto et al, 2017;Ive et al, 2019;Takushima et al, 2019). Note that these visual attention mechanisms were trained in an unsupervised manner, and, as far as we know, a supervised visual attention mechanism has not yet been proposed.…”
Section: Experiments With Manual Word Alignmentsmentioning
confidence: 99%