ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Cui, Yuhao; Yu, Zhou; Wang, Chunqi; Zhao, Zhongzhou; Zhang, Ji; Wang, Meng; Yu, Jun

doi:10.1145/3474085.3475251

Cited by 34 publications

(21 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[Ji et al, 2019] adopted a visual saliency detection module to guide the cross-modal correlation. [Cui et al, 2021] integrated intra-and cross-modal knowledge to learn the image and text features jointly.…”

Section: Feature Extractionmentioning

confidence: 99%

“…On the one hand, the intra-and cross-modal knowledge in the image and text data are fully exploited in the pre-training ITR approaches [Li et al, 2020c;Cui et al, 2021]. On the other hand, many studies concentrate on increasing the scale of pre-training data.…”

Section: Pre-training Image-text Retrievalmentioning

confidence: 99%

“…And the latter noticed that an object or a word might have different semantics under the different global contexts and proposed to adaptively select informative local components based on the global context for the local alignment. After that, some approaches with the same goal as the above have been successively proposed with either designing an alignment guided masking strategy[Zhuge et al, 2021] or developing an attention filtration technique[Diao et al, 2021]. On the other…”

mentioning

confidence: 99%

See 2 more Smart Citations

Image-text Retrieval: A Survey on Recent Research and Development

Cao¹,

Li²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

In the past few years, cross-modal image-text retrieval (ITR) has experienced increased interest in the research community due to its excellent research value and broad real-world application. It is designed for the scenarios where the queries are from one modality and the retrieval galleries from another modality. This paper presents a comprehensive and up-to-date survey on the ITR approaches from four perspectives. By dissecting an ITR system into two processes: feature extraction and feature alignment, we summarize the recent advance of the ITR approaches from these two perspectives. On top of this, the efficiency-focused study on the ITR system is introduced as the third perspective. To keep pace with the times, we also provide a pioneering overview of the cross-modal pre-training ITR approaches as the fourth perspective. Finally, we outline the common benchmark datasets and evaluation metric for ITR, and conduct the accuracy comparison among the representative ITR approaches. Some critical yet less studied issues are discussed at the end of the paper.

show abstract

Section: Feature Extractionmentioning

confidence: 99%

Section: Pre-training Image-text Retrievalmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Image-text Retrieval: A Survey on Recent Research and Development

Cao¹,

Li²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The past few years have witnessed the rapid development of Vision-Language Pre-training (VLP) models [2,4,17,39], and task-specific finetune on VLP models has become a new and state-of-the-art paradigm in many multimedia tasks [20,21,33]. Beyond accuracy, fairness which concerns about the discrimination towards socially protected or sensitive groups plays a critical role in trustworthy deployment of VLP models in downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models

Zhang¹,

Wang²,

Sang³

2022

Preprint

View full text Add to dashboard Cite

Vision-Language Pre-training (VLP) models have achieved state-ofthe-art performance in numerous cross-modal tasks. Since they are optimized to capture the statistical properties of intra-and intermodality, there remains risk to learn social biases presented in the data as well. In this work, we (1) introduce a counterfactual-based bias measurement CounterBias to quantify the social bias in VLP models by comparing the [MASK]ed prediction probabilities of factual and counterfactual samples; (2) construct a novel VL-Bias dataset including 24K image-text pairs for measuring gender bias in VLP models, from which we observed that significant gender bias is prevalent in VLP models; and (3) propose a VLP debiasing method FairVLP to minimize the difference in the [MASK]ed prediction probabilities between factual and counterfactual image-text pairs for VLP debiasing. Although CounterBias and FairVLP focus on social bias, they are generalizable to serve as tools and provide new insights to probe and regularize more knowledge in VLP models. CCS CONCEPTS• Social and professional topics → Computing / technology policy; • Applied computing → Law, social and behavioral sciences; • Computing methodologies → Machine learning.

show abstract

“…It has become more and more unrealistic to artificially watch and process such a tremendous amount of video data. With the further demand for computers to automatically analyze, understand, and process video content, many video understanding problems [31][32][33] in deep learning and computer vision arise and thrive, such as video visual question answering [5,10,11,18,22] and language-guided video action localization [2,34]. Referring video object segmentation aims to selectively segment one specific object spatially and temporally in a video according to a language query.…”

Section: Introductionmentioning

confidence: 99%

Multi-Attention Network for Compressed Video Referring Object Segmentation

Chen¹,

Hong²,

Qi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Referring video object segmentation aims to segment the object referred by a given language expression. Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented, which increases computation and storage requirements and ultimately slows the inference down. This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones. To alleviate this problem, in this paper, we explore the referring object segmentation task on compressed videos, namely on the original video data flow. Besides the inherent difficulty of the video referring object segmentation task itself, obtaining discriminative representation from compressed video is also rather challenging. To address this problem, we propose a multi-attention network which consists of dual-path dual-attention module and a query-based cross-modal Transformer module. Specifically, the dual-path dual-attention module is designed to extract effective representation from compressed data in three modalities, i.e., I-frame, Motion Vector and Residual. The query-based cross-modal Transformer firstly models the correlation between linguistic and visual modalities, and then the fused multi-modality features are used to guide object queries to generate a content-aware dynamic kernel and to predict final segmentation masks. Different from previous works, we propose to learn just one kernel, which thus removes the complicated post mask-matching procedure of existing methods. Extensive promising experimental results on three challenging datasets show the effectiveness of our method compared against several state-of-the-art methods which are proposed for processing RGB data. Source code is available at: https://github.com/DexiangHong/MANet.

show abstract

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Cited by 34 publications

References 51 publications

Image-text Retrieval: A Survey on Recent Research and Development

Image-text Retrieval: A Survey on Recent Research and Development

Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models

Multi-Attention Network for Compressed Video Referring Object Segmentation

Contact Info

Product

Resources

About