Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations

Wu, Hao; Mao, Jiayuan; Zhang, Yufeng; Jiang, Yuning; Li, Lei; Sun, Weiwei; Ma, Wei‐Ying

doi:10.1109/cvpr.2019.00677

Cited by 114 publications

(64 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In most configurations, CSLS is slightly better than IS on improving text→image inference while IS is better at image→text. The best results (line 3.8, 3.9) are even better than the recently reported state-ofthe-art (Wu et al, 2019) (Table 4 line 3.14), which performs a naive nearest neighbor search. This suggests that the hubness problem deserves much more attention and careful selection of inference methods is vital for text-image matching.…”

Section: Hubs During Inferencementioning

confidence: 71%

A Strong and Robust Baseline for Text-Image Matching

Liu¹,

Ye²

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

View full text Add to dashboard Cite

We review the current schemes of text-image matching models and propose improvements for both training and inference. First, we empirically show limitations of two popular loss (sum and max-margin loss) widely used in training text-image embeddings and propose a trade-off: a kNN-margin loss which 1) utilizes information from hard negatives and 2) is robust to noise as all K-most hardest samples are taken into account, tolerating pseudo negatives and outliers. Second, we advocate the use of Inverted Softmax (IS) and Crossmodal Local Scaling (CSLS) during inference to mitigate the so-called hubness problem in high-dimensional embedding space, enhancing scores of all metrics by a large margin.

show abstract

Section: Hubs During Inferencementioning

confidence: 71%

A Strong and Robust Baseline for Text-Image Matching

Liu¹,

Ye²

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

View full text Add to dashboard Cite

show abstract

“…Learning visually grounded semantics to facilitate cross-modal retrieval (i.e., image-to-text and textto-image) is a challenging task for cross-modal learning (Faghri et al, 2018;Wu et al, 2019). Different from image captioning tasks, radiology reports are often longer and consist of multiple sentences, each related to different abnormal findings; meanwhile, there are fewer distinct objects in radiology images and the differences among images are more subtle.…”

Section: Visual-semantic Embeddings For Cross-modal Retrievalmentioning

confidence: 99%

Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays

Hsu

Gentili

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Automatic medical image report generation has drawn growing attention due to its potential to alleviate radiologists' workload. Existing work on report generation often trains encoder-decoder networks to generate complete reports. However, such models are affected by data bias (e.g. label imbalance) and face common issues inherent in text generation models (e.g. repetition). In this work, we focus on reporting abnormal findings on radiology images; instead of training on complete radiology reports, we propose a method to identify abnormal findings from the reports in addition to grouping them with unsupervised clustering and minimal rules. We formulate the task as cross-modal retrieval and propose Conditional Visual-Semantic Embeddings to align images and fine-grained abnormal findings in a joint embedding space. We demonstrate that our method is able to retrieve abnormal findings and outperforms existing generation models on both clinical correctness and text generation metrics.

show abstract

“…The core issue of most existing studies [9], [26], [27], [33], [40], [42] for image-text matching can summarized as learning the joint representations for both modalities.…”

Section: Related Work a Image-text Matchingmentioning

confidence: 99%

Multi-Modal Memory Enhancement Attention Network for Image-Text Matching

Lin

Wang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Image-text matching is an attractive research topic in the community of vision and language. The key element to narrow the ''heterogeneity gap'' between visual and textual data lies in how to learn powerful and robust representations for both modalities. This paper proposes to alleviate this issue to achieve the fine-grained visual-textual alignment from two aspects: exploiting attention mechanism to locate the semantically meaningful portion and leveraging the memory network to capture the long-term contextual knowledge. Unlike most existing studies sorely focus on exploring the cross-modal associations at the fragment level, our designed Collaborative Dual Attention (CDA) module is able to model the semantic interdependencies from both perspectives of fragment and channel. Furthermore, considering the usage of long-term contextual knowledge contributes to compensate for detailed semantics concealed in the rarely appeared image-text pairs, we present to learn the joint representations by constructing a Multi-Modal Memory Enhancement (M3E) module. Specifically, it sequentially restores the intra-modal and multi-modal information into the memory items, and they conversely persistently memorize cross-modal shared semantics to improve the latent embeddings. By incorporating both CDA and M3E modules into a deep architecture, our approach generates more semantically consistent embeddings for representing images and texts. Extensive experiments demonstrate our model can achieve the state-of-the-art results on two public benchmark datasets. INDEX TERMS Image-text matching, attention mechanism, memory network.

show abstract

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations

Cited by 114 publications

References 29 publications

A Strong and Robust Baseline for Text-Image Matching

A Strong and Robust Baseline for Text-Image Matching

Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays

Multi-Modal Memory Enhancement Attention Network for Image-Text Matching

Contact Info

Product

Resources

About