Exploring Logical Reasoning for Referring Expression Comprehension

Cheng, Ying; Wang, Ruize; Yu, Jiashuo; Zhao, Ran; Zhang, Yuejie; Feng, Rui

doi:10.1145/3474085.3475677

Cited by 5 publications

(2 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BBA (Li, Bu, and Cai 2021) proposes a multi-step bidirectional potential referred pairs to align different granularity level by pyramid visual and textual features. LGREC (Cheng et al 2021) extends a logical matching module based on CM-Att-Erase (Liu et al 2019), which performs logical matching over them with explicit logical sentences. CMRE (Yang, Li, and Yu 2021) proposes a crossmodal relation extractor to generate a semantic graph guided by sentences and images.…”

Section: Related Work Visual Groundingmentioning

confidence: 99%

Linking People across Text and Images Based on Social Relation Reasoning

Yang

Zhao

et al. 2023

AAAI

View full text Add to dashboard Cite

As a sub-task of visual grounding, linking people across text and images aims to localize target people in images with corresponding sentences. Existing approaches tend to capture superficial features of people (e.g., dress and location) that suffer from the incompleteness information across text and images. We observe that humans are adept at exploring social relations to assist identifying people. Therefore, we propose a Social Relation Reasoning (SRR) model to address the aforementioned issues. Firstly, we design a Social Relation Extraction (SRE) module to extract social relations between people in the input sentence. Specially, the SRE module based on zero-shot learning is able to extract social relations even though they are not defined in the existing datasets. A Reasoning based Cross-modal Matching (RCM) module is further used to generate matching matrices by reasoning on the social relations and visual features. Experimental results show that the accuracy of our proposed SRR model outperforms the state-of-the-art models on the challenging datasets Who's Waldo and FL: MSRE, by more than 5\% and 7\%, respectively. Our source code is available at https://github.com/VILAN-Lab/SRR.

show abstract

Section: Related Work Visual Groundingmentioning

confidence: 99%

Linking People across Text and Images Based on Social Relation Reasoning

Yang

Zhao

et al. 2023

AAAI

View full text Add to dashboard Cite

show abstract

“…The Panoptic Narrative Grounding (PNG) task is rapidly gaining prominence as a critical area of research in the multimodal domain [11,36,37,52,58,59]. This task aims to generate a pixel-level mask for each noun present in a given long sentence, providing a more fine-grained understanding compared to other cross-modal tasks, such as image captioning [6,35,42,51,62], visual question answering [23,47,57,73], and referring expression comprehension/segmentation [5,19,[28][29][30]33]. This level of detail sets it apart and opens up a wide range of potential applications, including fine-grained image editing [22,54] and fine-grained image-text retrieval [17,45].…”

Section: Introductionmentioning

confidence: 99%

Semi-Supervised Panoptic Narrative Grounding

Yang,

Ji,

Sun

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

Despite considerable progress, the advancement of Panoptic Narrative Grounding (PNG) remains hindered by costly annotations. In this paper, we introduce a novel Semi-Supervised Panoptic Narrative Grounding (SS-PNG) learning scheme, capitalizing on a smaller set of labeled image-text pairs and a larger set of unlabeled pairs to achieve competitive performance. Unlike visual segmentation tasks, PNG involves one pixel belonging to multiple open-ended nouns. As a result, existing multi-class based semi-supervised segmentation frameworks cannot be directly applied to this task. To address this challenge, we first develop a novel SS-PNG Network (SS-PNG-NW) tailored to the SS-PNG setting. We thoroughly investigate strategies such as Burn-In and data augmentation to determine the optimal generic configuration for the SS-PNG-NW. Additionally, to tackle the issue of imbalanced pseudo-label quality,

show abstract