Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Jiao, Yang; Jie, Zequn; Luo, Weixin; Chen, Jingjing; Jiang, Yu‐Gang; Wei, Xiaolin; Ma, Lin

doi:10.1145/3474085.3475222

Cited by 16 publications

(2 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One-stage frameworks (Suo et al 2021;Li and Sigal 2021;Hu, Rohrbach, and Darrell 2016) have been proposed. To model semantic relationships between vision and language, recent methods (Ding et al 2021;Feng et al 2021;Jiao et al 2021;Li and Sigal 2021;Yang et al 2022;Luo et al 2020b)incorporate complex cross-attention mechanisms inspired by the powerful abilities of Transformers (Vaswani et al 2017) for capturing long-range dependencies.…”

Section: Referring Expression Segmentationmentioning

confidence: 99%

Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation

Guo,

Wang,

et al. 2024

AAAI

View full text Add to dashboard Cite

Recent advancements in single-stage Panoptic Narrative Grounding (PNG) have demonstrated significant potential. These methods predict pixel-level masks by directly matching pixels and phrases. However, they often neglect the modeling of semantic and visual relationships between phrase-level instances, limiting their ability for complex multi-modal reasoning in PNG. To tackle this issue, we propose XPNG, a “differentiation-refinement-localization” reasoning paradigm for accurately locating instances or regions. In XPNG, we introduce a Semantic Context Convolution (SCC) module to leverage semantic priors for generating distinctive features. This well-crafted module employs a combination of dynamic channel-wise convolution and pixel-wise convolution to embed semantic information and establish inter-object relationships guided by semantics. Subsequently, we propose a Visual Context Verification (VCV) module to provide visual cues, eliminating potential space biases introduced by semantics and further refining the visual features generated by the previous module. Extensive experiments on PNG benchmark datasets reveal that our approach achieves state-of-the-art performance, significantly outperforming existing methods by a considerable margin and yielding a 3.9-point improvement in overall metrics. Our codes and results are available at our project webpage: https://github.com/TianyuGoGO/XPNG.

show abstract

Section: Referring Expression Segmentationmentioning

confidence: 99%

Improving Panoptic Narrative Grounding by Harnessing Semantic Relationships and Visual Confirmation

Guo,

Wang,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Recent years have witnessed the great success of deep learning techniques on a series of tasks (He et al 2016;Liu et al 2018;Feng et al 2021), such as image recognition (He et al 2016;Liu et al 2020;Chen et al 2020b,a), Image segmentation (Jiao et al 2021), object detection (Ren et al 2016), video recognition and retrieval (Wu et al 2020c;Song et al 2021). Therefore, DNNs have been widely applied in realworld applications, e.g., online recognition services, navigation robots, autonomous driving (Tian et al 2018), etc.…”

Section: Introductionmentioning

confidence: 99%

Boosting the Transferability of Video Adversarial Examples via Temporal Translation

Wei

Chen

et al. 2022

AAAI

Self Cite

View full text Add to dashboard Cite

Although deep-learning based video recognition models have achieved remarkable success, they are vulnerable to adversarial examples that are generated by adding human-imperceptible perturbations on clean video samples. As indicated in recent studies, adversarial examples are transferable, which makes it feasible for black-box attacks in real-world applications. Nevertheless, most existing adversarial attack methods have poor transferability when attacking other video models and transfer-based attacks on video models are still unexplored. To this end, we propose to boost the transferability of video adversarial examples for black-box attacks on video recognition models. Through extensive analysis, we discover that different video recognition models rely on different discriminative temporal patterns, leading to the poor transferability of video adversarial examples. This motivates us to introduce a temporal translation attack method, which optimizes the adversarial perturbations over a set of temporal translated video clips. By generating adversarial examples over translated videos, the resulting adversarial examples are less sensitive to temporal patterns existed in the white-box model being attacked and thus can be better transferred. Extensive experiments on the Kinetics-400 dataset and the UCF-101 dataset demonstrate that our method can significantly boost the transferability of video adversarial examples. For transfer-based attack against video recognition models, it achieves a 61.56% average attack success rate on the Kinetics-400 and 48.60% on the UCF-101.

show abstract