A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension

Liao, Yue; Liu, Si; Li, Guanbin; Wang, Fei; Chen, Yanjie; Qian, Chen; Li, Bo

doi:10.1109/cvpr42600.2020.01089

Cited by 168 publications

(102 citation statements)

References 21 publications

Supporting

Mentioning

100

Contrasting

Order By: Relevance

“…Another line to improve one-stage visual grounding is to better comprehend longer expression, especially for Re-fCOCOg, which contains more complex sentences. Although decomposing the expressions can achieve significant improvement [14,38], we adopt a gloabl language representation for the sake of simplicity. On RefCOCOg, our model still improves the performance upon our baseline [39] by 12% and 6%, with LSTM and BERT, respectively, showing that modeling long-range spatial relations can help to comprehend longer sentences since these cases require more spatial relational cues to localize the target.…”

Section: Quantitative Resultsmentioning

confidence: 99%

“…Following [28], only the anchor with the largest IoU with the ground-truth bounding box is assigned as a positive sample; the rest are negative samples. Therefore, there VGG-16 LSTM 28.33 -VC [47] VGG-16 LSTM 31.13 -Similarity Net [32] ResNet-101 -34.54 184 CITE [27] ResNet-101 -35.07 196 MAttNet [43] ResNet-101 LSTM 29.04 320 DDPN ‡ [46] ResNet-101 LSTM 63.00 -One-stage methods ZSGNet [29] ResNet-50 LSTM 58.63 25 RCCF [14] DLA-34 LSTM 63.79 25 YOLO-VG [39] DarkNet-53 LSTM 58.76 21 YOLO-VG [39] DarkNet-53 BERT 59.30 38 SQC-Base [38] DarkNet-53 BERT 64.33 26 SQC-Large [38] DarkNet-53 BERT 64.60 36 Baseline [39] DarkNet-53 LSTM 59. Table 2.…”

Section: Landmark Feature Convolution Modulementioning

confidence: 99%

See 1 more Smart Citation

Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding

Huang

Lian

Luo

et al. 2021

Preprint

View full text Add to dashboard Cite

An LBYL ('Look Before You Leap') Network is proposed for end-to-end trainable one-stage visual grounding. The idea behind LBYL-Net is intuitive and straightforward: we follow a language's description to localize the target object based on its relative spatial relation to 'Landmarks', which is characterized by some spatial positional words and some descriptive words about the object. The core of our LBYL-Net is a landmark feature convolution module that transmits the visual features with the guidance of linguistic description along with different directions. Consequently, such a module encodes the relative spatial positional relations between the current object and its context. Then we combine the contextual information from the landmark feature convolution module with the target's visual features for grounding. To make this landmark feature convolution light-weight, we introduce a dynamic programming algorithm (termed dynamic max pooling) with low complexity to extract the landmark feature. Thanks to the landmark feature convolution module, we mimic the human behavior of 'Look Before You Leap' to design an LBYL-Net, which takes full consideration of contextual information. Extensive experiments show our method's effectiveness in four grounding datasets. Specifically, our LBYL-Net outperforms all state-of-the-art two-stage and one-stage methods on ReferitGame. On RefCOCO and RefCOCO+, Our LBYL-Net also achieves comparable results or even better results than existing one-stage methods. Code is available at https://github.com/svip-lab/LBYLNet.

show abstract

Section: Quantitative Resultsmentioning

confidence: 99%

Section: Landmark Feature Convolution Modulementioning

confidence: 99%

Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding

Huang

Lian

Luo

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Section: Evaluation On Seen Datasetsmentioning

confidence: 99%

“…Table 1 shows the results of the proposed algorithm against state-of-the-art methods [27,25,44,21,48,45,42,22,3,37,41,19,40]. All the compared approaches except for the recent methods [3,41,19,40] adopt two-stage frameworks, where the prediction is chosen from a set of proposals. Therefore, their models are not end-to-end trainable, while our one-stage framework is able to learn better feature representations by end-to-end training.…”

Section: Evaluation On Seen Datasetsmentioning

confidence: 99%

Understanding Synonymous Referring Expressions via Contrastive Features

Chen¹,

Tsai²,

Yang³

2021

Preprint

View full text Add to dashboard Cite

Referring expression comprehension aims to localize objects identified by natural language descriptions. This is a challenging task as it requires understanding of both visual and language domains. One nature is that each object can be described by synonymous sentences with paraphrases, and such varieties in languages have critical impact on learning a comprehension model. While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences. To this end, we develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets, and demonstrate that our method performs favorably against the state-of-the-art approaches. Furthermore, since the varieties in expressions become larger across datasets when they describe objects in different ways, we present the cross-dataset and transfer learning settings to validate the ability of our learned transferable features.

show abstract

“…Multi-modal grounding 1 tasks (e.g., phrase localization [1,3,9,41,49], referring expression comprehension [17,19,24,26,29,30,37,51,52,55,56] and segmentation [6,18,20,21,29,38,53,56]) aim to generalize traditional object detection and segmentation to localization of regions (rectangular or at a pixel level) in images that correspond to free-form linguistic expressions. These tasks have emerged as core problems in vision and ML due to the breadth of applications that can make use of such techniques, spanning image captioning, visual question answering, visual reasoning and others.…”

Section: Introductionmentioning

confidence: 99%

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Li,

Sigal

2021

Preprint

View full text Add to dashboard Cite

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension / segmentation) has been widely explored. Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-art methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.Preprint. Under review.

show abstract

A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension

Cited by 168 publications

References 21 publications

Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding

Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding

Understanding Synonymous Referring Expressions via Contrastive Features

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Contact Info

Product

Resources

About