2018
DOI: 10.1007/978-3-030-01258-8_16
|View full text |Cite
|
Sign up to set email alerts
|

Conditional Image-Text Embedding Networks

Abstract: This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
52
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 93 publications
(54 citation statements)
references
References 36 publications
0
52
0
Order By: Relevance
“…For a fair comparison, all these methods use a fixed RPN to obtain the candidate boxes and represent them in features that are not tuned on the Flickr30K Entities dataset. We believe that using an additional conditional embedding unit as in [32], and the integration of a proposal generation network with a spatial regression that is tuned on Flickr30K Entities as in [3] should improve the overall result even more. Table 2 shows the phrase grounding performance with respect to the coarse categories in Flickr30K Entitites dataset.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For a fair comparison, all these methods use a fixed RPN to obtain the candidate boxes and represent them in features that are not tuned on the Flickr30K Entities dataset. We believe that using an additional conditional embedding unit as in [32], and the integration of a proposal generation network with a spatial regression that is tuned on Flickr30K Entities as in [3] should improve the overall result even more. Table 2 shows the phrase grounding performance with respect to the coarse categories in Flickr30K Entitites dataset.…”
Section: Resultsmentioning
confidence: 99%
“…Wang et al [48] proposes a structured matching method which attempts to reflect the semantic relation of phrases onto the visual relations of their corresponding regions without considering the global sentence-level context. Plummer et al [32] proposes to use multiple text-conditioned embeddings in a single end-to-end model with impressive results on Flickr30K Entities dataset [34].…”
Section: Related Workmentioning
confidence: 99%
“…Accu@0.5 SCRC [19] 17.93 MCB + Reg + Spatial [5] 26.54 GroundeR + Spatial [43] 26.93 Similarity Network + Spatial [47] 31.26 CGRE [37] 31.85 MNN + Reg + Spatial [5] 32.21 EB + QRN (VGG cls -SPAT) [6] 32.21 CITE [38] 34.13 IGOP [52] 34.70 QRC Net [6] 44.07 G3RAPHGROUND++ [3] 44.91…”
Section: Methodsmentioning
confidence: 99%
“…We validate our model with different alignment regularization weights and find 3 gives the best performance, so is set to 3 in all following experiments. [6] 53.48 IGOP [52] 53.97 SPC+PPC [39] 55.49 SS+QRN [6] 55.99 CITE [38] 59.27 SeqGROUND [10] 61.60 G3RAPHGROUND++ [3] 66.93 Visual-BERT [30] 71.33 Contextual Grounding [29] 71.36…”
Section: Implementation Detailsmentioning
confidence: 99%
“…Rohrbach et al (2016) proposed an attention mechanism to attend to relevant object proposals for a given phrase and designed a loss for phrase reconstruction. Plummer et al (2018) presented an approach to jointly learn multiple text-conditioned embedding in a single endto-end network. In DDPN (Yu et al 2018b), they learned a diversified and discriminate proposal network to generate higher quality object candidates.…”
Section: Related Workmentioning
confidence: 99%