Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

Rufus, Nivedita; Nair, Unni Krishnan R; Krishna, K. Madhava; Gandhi, Vineet

doi:10.1007/978-3-030-66096-3_4

Cited by 11 publications

(21 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A tough-to-beat baseline for visual grounding (CMSVG) Rufus et al [35] showed that the Bi-directional retrieval approach can outperform more sophisticated approaches such as MSRR [9] and MAC [17] by simply using state-of-the-art object and sentence encoders. They also performed extensive ablation studies to analyse the influence of the used number of region proposals, the used image encoder, and the used text encoder.…”

Section: Cosine Meets Softmaxmentioning

confidence: 99%

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Deruyttere

Vandenhende

Grujicic

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be associated with an object in the street scene. To stimulate research on this topic, we have organized the Commands for Autonomous Vehicles (C4AV) challenge based on the recent Talk2Car dataset. This paper presents the results of the challenge. First, we compare the used benchmark against existing datasets for visual grounding. Second, we identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding, in addition to detecting potential failure cases by evaluating on carefully selected subsets. Finally, we discuss several possibilities for future work.

show abstract

Section: Cosine Meets Softmaxmentioning

confidence: 99%

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Deruyttere

Vandenhende

Grujicic

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…REC has also been explored on autonomous driving applications, following the introduction of the Talk2Car dataset [1]. Rufus et al [8] use softmax on cosine similarity between region-phrase pairs and employ a cross-entropy loss. Ou et al [9] employ multimodal attention using individual keywords and regions.…”

Section: A Referring Expression Comprehensionmentioning

confidence: 99%

Grounding Linguistic Commands to Navigable Regions

Rufus,

Jain,

Nair

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Humans have a natural ability to effortlessly comprehend linguistic commands such as "park next to the yellow sedan" and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command "park next to the yellow sedan," RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car [1] dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.

show abstract

“…Current most successful approaches use pre-trained language models to encode the language command (Lu et al, 2020;Chen et al, 2020). For the current study, we use the model from Rufus et al (2020) which uses a pre-trained Sentence-BERT by Reimers and Gurevych (2019) to encode commands, and a pretrained EfficientNet-b2 by Tan and Le (2019) to encode objects detected in the image 1 . However, detecting the referred object in the command is not always correct, hence the importance of accurate uncertainty detection and quantification.…”

Section: Detection Of the Referred Object Of The Commandmentioning

confidence: 99%

“…For readability, we notate the probability distribution over the set of objects as p(O I |Φ, θ), with Φ the set of all inputs. Although our model is agnostic of the underlying VG model for computing this probability distribution, in this paper we make use of the CMSVG model (Rufus et al, 2020) as the VG model, since it is one of the top-performing models on the Talk2Car dataset at the time of writing. This model uses CenterNet (Duan et al, 2019) as a RPN to extract the set of objects O I objects from image I.…”

Section: Visual Grounding (Vg) Modelmentioning

confidence: 99%

See 1 more Smart Citation

Giving Commands to a Self-Driving Car: How to Deal with Uncertain Situations?

Deruyttere,

Milewski,

Moens

2021

Preprint

View full text Add to dashboard Cite

Current technology for autonomous cars primarily focuses on getting the passenger from point A to B. Nevertheless, it has been shown that passengers are afraid of taking a ride in self-driving cars. One way to alleviate this problem is by allowing the passenger to give natural language commands to the car. However, the car can misunderstand the issued command or the visual surroundings which could lead to uncertain situations. It is desirable that the self-driving car detects these situations and interacts with the passenger to solve them. This paper proposes a model that detects the uncertain situations when a command is given and finds the visual objects causing it. Optionally, a question generated by the system describing the uncertain objects is included. We argue that if the car could explain the objects in a human-like way, passengers could gain more confidence in the car's abilities. Thus, we investigate how to (1) detect uncertain situations and their underlying causes, and (2) how to generate clarifying questions for the passenger. When evaluating on the Talk2Car dataset, we show that the proposed model, Uncertainty Resolving System (URS), improves 12.6% in terms of IoU .5 compared to not using URS. Furthermore, we designed a referring expression generator (REG) Attribute-Referring Expression Generator (A-REG) tailored to a self-driving car setting which yields a relative improvement of 6% METEOR and 8% ROUGE-l compared with state-of-the-art REG models, and is three times faster.

show abstract

Cosine Meets Softmax: A Tough-to-beat Baseline for Visual Grounding

Cited by 11 publications

References 20 publications

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Grounding Linguistic Commands to Navigable Regions

Giving Commands to a Self-Driving Car: How to Deal with Uncertain Situations?

Contact Info

Product

Resources

About