2016 IEEE International Conference on Image Processing (ICIP) 2016
DOI: 10.1109/icip.2016.7533033
|View full text |Cite
|
Sign up to set email alerts
|

Image description through fusion based recurrent multi-modal learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 9 publications
0
7
0
Order By: Relevance
“…Moreover, in terms of the attention modules in image captioning, there are also all kinds of variants having been explored. For examples, we could easily see the multihead attention in [26], the gate-controlled attention [21], the fully attentive paradigms [28,34], the meshed-connection attention [31], the dual attention [35], etc.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Moreover, in terms of the attention modules in image captioning, there are also all kinds of variants having been explored. For examples, we could easily see the multihead attention in [26], the gate-controlled attention [21], the fully attentive paradigms [28,34], the meshed-connection attention [31], the dual attention [35], etc.…”
Section: Related Workmentioning
confidence: 99%
“…For example, we could easily see the huge improvements by early work of deep neural models in [13,14,15]. Besides, more neural network based models, such as the multimodal learning [16,17], the encoder−decoder framework [18], the attention mechanism [19,20], the compositional architectures [21], the describing method of novel objects [22] and the deep bifurcation network [23], have been proposed one after another.…”
Section: Introductionmentioning
confidence: 99%
“…In automotive or indoor robotic visual perception problems, simple concatenation techniques perform well but they fall short in some applications like video captioning [10,33] or summarization [42] where long term dependencies are required. LSTMs in such cases offer a better alternative [59,45].…”
Section: Feature Aggregationmentioning
confidence: 99%
“…Yao et al (2016) parinject the image whilst pre-injecting image attributes (or vice versa); and Liu et al (2016) par-inject attributes from the image whilst init-injecting the image vector. Other, less common instantiations include par-injecting the image, but only with the first word (this is not pre-inject as the image is not injected on a separate time step) (Karpathy andFei-Fei, 2015, Hessel et al, 2015); and passing the words through a separate RNN, such that the resulting hidden state vectors are what is combined with the image vector (Oruganti et al, 2016). Many times this architecture is used in order to pass a different representation of the same image with every word so that visual information changes for different parts of the sentence being generated.…”
Section: Types Of Architecturesmentioning
confidence: 99%
“…For example, Zhou et al (2016) perform element-wise multiplication of the image vector with the last generated word’s embedding vector in order to attend to different parts of the image vector. Oruganti et al (2016) pass the image through its own RNN for as many times as there are words in order to use a different image vector for every word. Chen and Zitnick (2014, 2015) use a simple RNN to try to predict what the image vector looks like given a prefix.…”
Section: Introductionmentioning
confidence: 99%