2020
DOI: 10.1007/978-3-030-58568-6_23
|View full text |Cite
|
Sign up to set email alerts
|

Improving One-Stage Visual Grounding by Recursive Sub-query Construction

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
130
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 139 publications
(142 citation statements)
references
References 42 publications
1
130
0
Order By: Relevance
“…Some recent approaches adopt one-stage frameworks to tackle referring expression object segmentation [3], zeroshot grounding [30], and visual grounding [41,40,39], where the language features are fused with the object detector. In these methods, the models are end-to-end trainable and more computationally efficient.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Some recent approaches adopt one-stage frameworks to tackle referring expression object segmentation [3], zeroshot grounding [30], and visual grounding [41,40,39], where the language features are fused with the object detector. In these methods, the models are end-to-end trainable and more computationally efficient.…”
Section: Related Workmentioning
confidence: 99%
“…Table 1 shows the results of the proposed algorithm against state-of-the-art methods [27,25,44,21,48,45,42,22,3,37,41,19,40]. All the compared approaches except for the recent methods [3,41,19,40] adopt two-stage frameworks, where the prediction is chosen from a set of proposals.…”
Section: Evaluation On Seen Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…While achieving a large-margin improvement compared with that try to directly regress object from the entire image, such progress may attribute to the robust local features representation on the grid. Another line to improve one-stage visual grounding is to apply complex language modeling, such as decomposing the longer phrase into multiple parts [38]. In this work, we do not use complex techniques for language modeling.…”
Section: Related Workmentioning
confidence: 99%
“…Eq.6 lution in v to obtain feature maps with the same channel c v and concatenate the coordinate features with a 8 dimension position embedding vector, which is the same with prior work [39,38], such that we generate the fused feature maps…”
Section: Landmark Feature Convolution Modulementioning
confidence: 99%