Multitask Text-to-Visual Embedding with Titles and Clickthrough Data

Aggarwal, Pranav; Lin, Zhe; Faieta, Baldo; Motiian, Saeid

doi:10.48550/arxiv.1905.13339

Cited by 1 publication

(2 citation statements)

References 8 publications

(8 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each text caption (anchor text) and (positive) image pair, we mine a hard negative sample within a training mini-batch using the online negative sampling strategy from [2]. We treat the caption corresponding to the negative image as the hard negative text.…”

Section: Training Strategymentioning

confidence: 99%

See 1 more Smart Citation

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

Aggarwal,

Tambi,

Kale

2021

Preprint

Self Cite

View full text Add to dashboard Cite

There has been a recent spike in interest in multi-modal Language and Vision problems. On the language side, most of these models primarily focus on English since most multi-modal datasets are monolingual. We try to bridge this gap with a zero-shot approach for learning multi-modal representations using cross-lingual pretraining on the text side. We present a simple yet practical approach for building a cross-lingual image retrieval model which trains on a monolingual training dataset but can be used in a zero-shot crosslingual fashion during inference. We also introduce a new objective function which tightens the text embedding clusters by pushing dissimilar texts away from each other. For evaluation, we introduce a new 1K multi-lingual MSCOCO2014 caption test dataset (XTD10) in 7 languages that we collected using a crowdsourcing platform. We use this as the test set for zero-shot model performance across languages. We also demonstrate how a cross-lingual model can be used for downstream tasks like multi-lingual image tagging in a zero shot manner. XTD10 dataset is made publicly available here: https: //github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10 CCS CONCEPTS• Computing methodologies → Neural networks.

show abstract

Section: Training Strategymentioning

confidence: 99%

“…For our experiments we see that when 𝜌 = 4, 𝛼 1 = 0.5 and 𝛼 2 = 1, we get the best results. To confirm its efficiency, we compare our results with another metric learning loss called "Positive Aware Triplet Ranking Loss (PATR)" [2] which performs a similar task without negative text.…”

Section: Training Strategymentioning

confidence: 99%

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

Aggarwal,

Tambi,

Kale

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Multitask Text-to-Visual Embedding with Titles and Clickthrough Data

Cited by 1 publication

References 8 publications

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

Contact Info

Product

Resources

About