Stacked Cross Attention for Image-Text Matching

Lee, Kuang-Huei; Chen, Xi; Hua, Gang; Hu, Houdong; He, Xiaodong

doi:10.1007/978-3-030-01225-0_13

Cited by 861 publications

(927 citation statements)

References 37 publications

Supporting

Mentioning

919

Contrasting

Unclassified

Order By: Relevance

“…We find that previous methods based on global representations [43] have low generalization performance with top-10 recall as low as 12 on COCO. Fine-grained representations based on attention [26] generalize better compared to [43]. Following Tables 5 and 6, the jWAE-MH framework significantly improves the generalization across datasets further, owing to the semantic continuity from the Gaussian regularization.…”

Section: Image-to-textmentioning

confidence: 80%

“…Wehrmann et al [45] improve sentence representations with a character level inception module and [20,26] improve image representations for image-text matching models. Huang et al [20] use multi-label classification to extract various concepts in images, requiring additional image annotations.…”

Section: Related Workmentioning

confidence: 99%

“…To show that our method learns meaningful representations with continuity in the latent space, we test the crossdataset generalization capability of our method against various retrieval approaches: CCA [24], the Embedding Network [43], and SCAN [26]. For cross-dataset generalization, the model is trained on the training set of one dataset, e.g.…”

Section: Image-to-textmentioning

confidence: 99%

“…To take advantage of such heterogeneous data, one of the fundamental challenges is the joint representation of multiple domains [3]. Powerful multimodal representations are integral to the accuracy of models on cross-domain tasks, such as image captioning [21] or cross-domain retrieval [1,7,9,24,26,43].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

Mahajan

Botschen

Gurevych

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

One of the key challenges in learning joint embeddings of multiple modalities, e.g. of images and text, is to ensure coherent cross-modal semantics that generalize across datasets. We propose to address this through joint Gaussian regularization of the latent representations. Building on Wasserstein autoencoders (WAEs) to encode the input in each domain, we enforce the latent embeddings to be similar to a Gaussian prior that is shared across the two domains, ensuring compatible continuity of the encoded semantic representations of images and texts. Semantic alignment is achieved through supervision from matching imagetext pairs. To show the benefits of our semi-supervised representation, we apply it to cross-modal retrieval and phrase localization. We not only achieve state-of-the-art accuracy, but significantly better generalization across datasets, owing to the semantic continuity of the latent space.

show abstract

Section: Image-to-textmentioning

confidence: 80%

Section: Related Workmentioning

confidence: 99%

Section: Image-to-textmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

Mahajan

Botschen

Gurevych

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

show abstract

“…Modeling interactions among splice sites is essential for circular RNA prediction because backsplices occur when the donors prefer the upstream acceptors over the downstream ones. Inspired by recent successes in natural language processing [22] and computer vision [32], we propose the cross-attention layer to learn deep interaction between acceptors and donors.…”

Section: Cross-attention For Modeling Deep Interactionmentioning

confidence: 99%

JEDI: Circular RNA Prediction based on Junction Encoders and Deep Interaction among Splice Sites

Jiang

Hao

et al. 2020

Preprint

View full text Add to dashboard Cite

Circular RNA is a novel class of endogenous non-coding RNAs that have been largely discovered in eukaryotic transcriptome. The circular structure arises from a non-canonical splicing process, where the donor site backsplices to an upstream acceptor site. These circular form of RNAs are conserved across species, and often show tissue or cell-specific expression. Emerging evidences have suggested its vital roles in gene regulation, which are further associated with various types of diseases. As the fundamental effort to elucidate its function and mechanism, numerous efforts have been devoted to predicting circular RNA from its primary sequence. However, statistical learning methods are constrained by the information presented with explicit features, and the existing deep learning approach falls short on fully exploring the positional information of the splice sites and their deep interaction. We present an effective and robust end-to-end framework, JEDI, for circular RNA prediction using only the nucleotide sequence. Our framework first leverages the attention mechanism to encode each junction site based on deep bidirectional recurrent neural networks and then presents the novel cross-attention layer to model deep interaction among these sites for backsplicing. Finally, JEDI is capable of not only addressing the task of circular RNA prediction but also interpreting the relationships among splice sites to discover the hotspots for backsplicing within a gene region. Experimental evaluations demonstrate that JEDI significantly outperforms several state-of-the-art approaches in circular RNA prediction on both isoform-level and gene-level. Moreover, JEDI also shows promising results on zero-shot backsplicing discovery, where none of the existing approaches can achieve. The implementation of our framework is available at https://github.com/hallogameboy/ JEDI.

show abstract

Cross‐modal retrieval with dual multi‐angle self‐attention

Zheng

Zhang

et al. 2020

Asso for Info Science & Tech

View full text Add to dashboard Cite

In recent years, cross‐modal retrieval has been a popular research topic in both fields of computer vision and natural language processing. There is a huge semantic gap between different modalities on account of heterogeneous properties. How to establish the correlation among different modality data faces enormous challenges. In this work, we propose a novel end‐to‐end framework named Dual Multi‐Angle Self‐Attention (DMASA) for cross‐modal retrieval. Multiple self‐attention mechanisms are applied to extract fine‐grained features for both images and texts from different angles. We then integrate coarse‐grained and fine‐grained features into a multimodal embedding space, in which the similarity degrees between images and texts can be directly compared. Moreover, we propose a special multistage training strategy, in which the preceding stage can provide a good initial value for the succeeding stage and make our framework work better. Very promising experimental results over the state‐of‐the‐art methods can be achieved on three benchmark datasets of Flickr8k, Flickr30k, and MSCOCO.

show abstract

Stacked Cross Attention for Image-Text Matching

Cited by 861 publications

References 37 publications

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

JEDI: Circular RNA Prediction based on Junction Encoders and Deep Interaction among Splice Sites

Cross‐modal retrieval with dual multi‐angle self‐attention

Contact Info

Product

Resources

About