Talk2Car: Taking Control of Your Self-Driving Car

Deruyttere, Thierry; Vandenhende, Simon; Grujicic, Dusan; Gool, Luc Van; Moens, Marie‐Francine

doi:10.18653/v1/d19-1215

Cited by 44 publications

(44 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Another approach that does not rely on the use of region proposals is the work of Hudson and Manning [18]. Although this method was originally developed for visual question answering, [10] adapted it to tackle the visual grounding task. The model uses a recurrent MAC cell to match the natural language command with a global representation of the image.…”

Section: Methodsmentioning

confidence: 99%

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Deruyttere

Vandenhende

Grujicic

et al. 2020

Computer Vision – ECCV 2020 Workshops

Self Cite

View full text Add to dashboard Cite

The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be associated with an object in the street scene. To stimulate research on this topic, we have organized the Commands for Autonomous Vehicles (C4AV) challenge based on the recent Talk2Car dataset. This paper presents the results of the challenge. First, we compare the used benchmark against existing datasets for visual grounding. Second, we identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding, in addition to detecting potential failure cases by evaluating on carefully selected subsets. Finally, we discuss several possibilities for future work.

show abstract

Section: Methodsmentioning

confidence: 99%

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Deruyttere

Vandenhende

Grujicic

et al. 2020

Computer Vision – ECCV 2020 Workshops

Self Cite

View full text Add to dashboard Cite

show abstract

“…Research at the intersection of language and vision has been conducted extensively in the last few years. The main topics include image captioning (Karpathy and Fei-Fei 2015;Xu et al 2015), visual question answering (VQA) (Agrawal et al 2017;Andreas et al 2016), object referring expressions (Deruyttere et al 2019;Anne Hendricks et al 2017;Balajee Vasudevan et al 2018;Vasudevan et al 2018), and grounded language learning (Hermann et al 2017;Hill et al 2017). Although the goals are different from ours, some of the fundamental techniques are shared.…”

Section: Related Workmentioning

confidence: 99%

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

2020

Self Cite

View full text Add to dashboard Cite

The role of robots in society keeps expanding, bringing with it the necessity of interacting and communicating with humans. In order to keep such interaction intuitive, we provide automatic wayfinding based on verbal navigational instructions. Our first contribution is the creation of a large-scale dataset with verbal navigation instructions. To this end, we have developed an interactive visual navigation environment based on Google Street View; we further design an annotation method to highlight mined anchor landmarks and local directions between them in order to help annotators formulate typical, human references to those. The annotation task was crowdsourced on the AMT platform, to construct a new Talk2Nav dataset with 10, 714 routes. Our second contribution is a new learning method. Inspired by spatial cognition research on the mental conceptualization of navigational instructions, we introduce a soft dual attention mechanism defined over the segmented language instructions to jointly extract two partial instructions—one for matching the next upcoming visual landmark and the other for matching the local directions to the next landmark. On the similar lines, we also introduce spatial memory scheme to encode the local directional transitions. Our work takes advantage of the advance in two lines of research: mental formalization of verbal navigational instructions and training neural network agents for automatic way finding. Extensive experiments show that our method significantly outperforms previous navigation methods. For demo video, dataset and code, please refer to our project page.

show abstract

“…In fact, humans surely need to use this capability for many daily activities such as for driving -certain alerting stimuli, such as horns of cars and sirens of ambulances, police cars, fire trucks and human speech are meant to be heard, i.e. are primarily acoustic [4], [9], [10]. Auditory perception can be used to localize common objects like a running car, which is especially useful when visual perception fails due to adverse visual conditions or occlusions.…”

Section: Introductionmentioning

confidence: 99%

Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

Dai¹,

Vasudevan²,

Matas³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Humans can robustly recognize and localize objects by using visual and/or auditory cues. While machines are able to do the same with visual data already, less work has been done with sounds. This work develops an approach for scene understanding purely based on binaural sounds. The considered tasks include predicting the semantic masks of sound-making objects, the motion of sound-making objects, and the depth map of the scene. To this aim, we propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 • camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of multiple vision 'teacher' methods and a sound 'student' method -the student method is trained to generate the same results as the teacher methods do. This way, the auditory system can be trained without using human annotations. To further boost the performance, we propose another novel auxiliary task, coined Spatial Sound Super-Resolution, to increase the directional resolution of sounds. We then formulate the four tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results show that 1) our method achieves good results for all four tasks, 2) the four tasks are mutually beneficial -training them together achieves the best performance, 3) the number and orientation of microphones are both important, and 4) features learned from the standard spectrogram and features obtained by the classic signal processing pipeline are complementary for auditory perception tasks. The data and code are released on the project page: https://www.trace.ethz.ch/publications/ 2020/sound perception/index.html.

show abstract

Talk2Car: Taking Control of Your Self-Driving Car

Cited by 44 publications

References 38 publications

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

Contact Info

Product

Resources

About