2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01089
|View full text |Cite
|
Sign up to set email alerts
|

A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
100
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 168 publications
(102 citation statements)
references
References 21 publications
0
100
0
Order By: Relevance
“…Another line to improve one-stage visual grounding is to better comprehend longer expression, especially for Re-fCOCOg, which contains more complex sentences. Although decomposing the expressions can achieve significant improvement [14,38], we adopt a gloabl language representation for the sake of simplicity. On RefCOCOg, our model still improves the performance upon our baseline [39] by 12% and 6%, with LSTM and BERT, respectively, showing that modeling long-range spatial relations can help to comprehend longer sentences since these cases require more spatial relational cues to localize the target.…”
Section: Quantitative Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Another line to improve one-stage visual grounding is to better comprehend longer expression, especially for Re-fCOCOg, which contains more complex sentences. Although decomposing the expressions can achieve significant improvement [14,38], we adopt a gloabl language representation for the sake of simplicity. On RefCOCOg, our model still improves the performance upon our baseline [39] by 12% and 6%, with LSTM and BERT, respectively, showing that modeling long-range spatial relations can help to comprehend longer sentences since these cases require more spatial relational cues to localize the target.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…Following [28], only the anchor with the largest IoU with the ground-truth bounding box is assigned as a positive sample; the rest are negative samples. Therefore, there VGG-16 LSTM 28.33 -VC [47] VGG-16 LSTM 31.13 -Similarity Net [32] ResNet-101 -34.54 184 CITE [27] ResNet-101 -35.07 196 MAttNet [43] ResNet-101 LSTM 29.04 320 DDPN ‡ [46] ResNet-101 LSTM 63.00 -One-stage methods ZSGNet [29] ResNet-50 LSTM 58.63 25 RCCF [14] DLA-34 LSTM 63.79 25 YOLO-VG [39] DarkNet-53 LSTM 58.76 21 YOLO-VG [39] DarkNet-53 BERT 59.30 38 SQC-Base [38] DarkNet-53 BERT 64.33 26 SQC-Large [38] DarkNet-53 BERT 64.60 36 Baseline [39] DarkNet-53 LSTM 59. Table 2.…”
Section: Landmark Feature Convolution Modulementioning
confidence: 99%
“…Table 1 shows the results of the proposed algorithm against state-of-the-art methods [27,25,44,21,48,45,42,22,3,37,41,19,40]. All the compared approaches except for the recent methods [3,41,19,40] adopt two-stage frameworks, where the prediction is chosen from a set of proposals.…”
Section: Evaluation On Seen Datasetsmentioning
confidence: 99%
“…Table 1 shows the results of the proposed algorithm against state-of-the-art methods [27,25,44,21,48,45,42,22,3,37,41,19,40]. All the compared approaches except for the recent methods [3,41,19,40] adopt two-stage frameworks, where the prediction is chosen from a set of proposals. Therefore, their models are not end-to-end trainable, while our one-stage framework is able to learn better feature representations by end-to-end training.…”
Section: Evaluation On Seen Datasetsmentioning
confidence: 99%
“…Multi-modal grounding 1 tasks (e.g., phrase localization [1,3,9,41,49], referring expression comprehension [17,19,24,26,29,30,37,51,52,55,56] and segmentation [6,18,20,21,29,38,53,56]) aim to generalize traditional object detection and segmentation to localization of regions (rectangular or at a pixel level) in images that correspond to free-form linguistic expressions. These tasks have emerged as core problems in vision and ML due to the breadth of applications that can make use of such techniques, spanning image captioning, visual question answering, visual reasoning and others.…”
Section: Introductionmentioning
confidence: 99%