Bidirectional Retrieval Made Simple

Wehrmann, Jônatas; Barros, Rodrigo C.

doi:10.1109/cvpr.2018.00805

Cited by 34 publications

(23 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have also proposed a novel inception-inspired text encoder named CHAIN-VSE for efficient multimodal retrieval [Wehrmann and Barros 2018]. That work was accepted in the CVPR 2018 main conference, which is the conference with highest H-index in computer science as of today.…”

Section: Summary Of Contributionsmentioning

confidence: 99%

Language-Agnostic Visual-Semantic Embeddings

Wehrmann

Barros

2021

Anais Do XXXIV Concurso De Teses E Dissertações Da SBC (CTD-SBC 2021)

Self Cite

View full text Add to dashboard Cite

We propose a framework for training language-invariant cross-modal retrieval models. We introduce four novel text encoding approaches, as well as a character-based word-embedding approach, allowing the model to project similar words across languages into the same word-embedding space. In addition, by performing cross-modal retrieval at the character level, the storage requirements for a text encoder decrease substantially, allowing for lighter and more scalable retrieval architectures. The proposed language-invariant textual encoder based on characters is virtually unaffected in terms of storage requirements when novel languages are added to the system. Contributions include new methods for building character-level-based word-embeddings, an improved loss function, and a novel cross-language alignment module that not only makes the architecture language-invariant, but also presents better predictive performance. Moreover, we introduce a module called \adapt, which is responsible for providing query-aware visual representations that generate large improvements in terms of recall for four widely-used large-scale image-text datasets. We show that our models outperform the current state-of-the-art all scenarios. This thesis can serve as a new path on retrieval research, now allowing for the effective use of captions in multiple-language scenarios.

show abstract

Section: Summary Of Contributionsmentioning

confidence: 99%

Language-Agnostic Visual-Semantic Embeddings

Wehrmann

Barros

2021

Anais Do XXXIV Concurso De Teses E Dissertações Da SBC (CTD-SBC 2021)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Wehrmann et al [45] improve sentence representations with a character level inception module and [20,26] improve image representations for image-text matching models. Huang et al [20] use multi-label classification to extract various concepts in images, requiring additional image annotations.…”

Section: Related Workmentioning

confidence: 99%

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

Mahajan

Botschen

Gurevych

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

One of the key challenges in learning joint embeddings of multiple modalities, e.g. of images and text, is to ensure coherent cross-modal semantics that generalize across datasets. We propose to address this through joint Gaussian regularization of the latent representations. Building on Wasserstein autoencoders (WAEs) to encode the input in each domain, we enforce the latent embeddings to be similar to a Gaussian prior that is shared across the two domains, ensuring compatible continuity of the encoded semantic representations of images and texts. Semantic alignment is achieved through supervision from matching imagetext pairs. To show the benefits of our semi-supervised representation, we apply it to cross-modal retrieval and phrase localization. We not only achieve state-of-the-art accuracy, but significantly better generalization across datasets, owing to the semantic continuity of the latent space.

show abstract

“…As the embedding space is learned through jointly modeling vision and language, it is often referred as Visual Semantic Embeddings (VSE). Recent work on VSE has shown a clear trend of growing dimensions in order to obtain better embedding quality (Wehrmann 2018). With deeper embeddings, visual semantic hubs increase dramatically.…”

Section: Introductionmentioning

confidence: 99%

HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs

Liu

Wang

et al. 2020

AAAI

View full text Add to dashboard Cite

The hubness problem widely exists in high-dimensional embedding space and is a fundamental source of error for cross-modal matching tasks. In this work, we study the emergence of hubs in Visual Semantic Embeddings (VSE) with application to text-image matching. We analyze the pros and cons of two widely adopted optimization objectives for training VSE and propose a novel hubness-aware loss function (Hal) that addresses previous methods' defects. Unlike (Faghri et al. 2018) which simply takes the hardest sample within a mini-batch, Hal takes all samples into account, using both local and global statistics to scale up the weights of “hubs”. We experiment our method with various configurations of model architectures and datasets. The method exhibits exceptionally good robustness and brings consistent improvement on the task of text-image matching across all settings. Specifically, under the same model architectures as (Faghri et al. 2018) and (Lee et al. 2018), by switching only the learning objective, we report a maximum R@1 improvement of 7.4% on MS-COCO and 8.3% on Flickr30k.1

show abstract

Bidirectional Retrieval Made Simple

Cited by 34 publications

References 21 publications

Language-Agnostic Visual-Semantic Embeddings

Language-Agnostic Visual-Semantic Embeddings

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs

Contact Info

Product

Resources

About