2020
DOI: 10.1007/978-3-030-66096-3_4
|View full text |Cite
|
Sign up to set email alerts
|

Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(21 citation statements)
references
References 20 publications
0
21
0
Order By: Relevance
“…A tough-to-beat baseline for visual grounding (CMSVG) Rufus et al [35] showed that the Bi-directional retrieval approach can outperform more sophisticated approaches such as MSRR [9] and MAC [17] by simply using state-of-the-art object and sentence encoders. They also performed extensive ablation studies to analyse the influence of the used number of region proposals, the used image encoder, and the used text encoder.…”
Section: Cosine Meets Softmaxmentioning
confidence: 99%
“…A tough-to-beat baseline for visual grounding (CMSVG) Rufus et al [35] showed that the Bi-directional retrieval approach can outperform more sophisticated approaches such as MSRR [9] and MAC [17] by simply using state-of-the-art object and sentence encoders. They also performed extensive ablation studies to analyse the influence of the used number of region proposals, the used image encoder, and the used text encoder.…”
Section: Cosine Meets Softmaxmentioning
confidence: 99%
“…REC has also been explored on autonomous driving applications, following the introduction of the Talk2Car dataset [1]. Rufus et al [8] use softmax on cosine similarity between region-phrase pairs and employ a cross-entropy loss. Ou et al [9] employ multimodal attention using individual keywords and regions.…”
Section: A Referring Expression Comprehensionmentioning
confidence: 99%
“…Current most successful approaches use pre-trained language models to encode the language command (Lu et al, 2020;Chen et al, 2020). For the current study, we use the model from Rufus et al (2020) which uses a pre-trained Sentence-BERT by Reimers and Gurevych (2019) to encode commands, and a pretrained EfficientNet-b2 by Tan and Le (2019) to encode objects detected in the image 1 . However, detecting the referred object in the command is not always correct, hence the importance of accurate uncertainty detection and quantification.…”
Section: Detection Of the Referred Object Of The Commandmentioning
confidence: 99%
“…For readability, we notate the probability distribution over the set of objects as p(O I |Φ, θ), with Φ the set of all inputs. Although our model is agnostic of the underlying VG model for computing this probability distribution, in this paper we make use of the CMSVG model (Rufus et al, 2020) as the VG model, since it is one of the top-performing models on the Talk2Car dataset at the time of writing. This model uses CenterNet (Duan et al, 2019) as a RPN to extract the set of objects O I objects from image I.…”
Section: Visual Grounding (Vg) Modelmentioning
confidence: 99%
See 1 more Smart Citation