Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

Shridhar, Mohit; Hsu, David

doi:10.15607/rss.2018.xiv.028

Cited by 113 publications

(99 citation statements)

References 28 publications

(50 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly to many studies, we are interested in understanding fetching instructions in everyday environments. Recent studies have addressed multimodal language understanding (MLU) by using visual semantic embedding for visual grounding [1], [10]- [13], visual question answering [14] or caption generation [15]. This approach embeds the visual and linguistic features into a common latent space.…”

Section: Related Workmentioning

confidence: 99%

A Multimodal Target-Source Classifier With Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects

Magassouba

Sugiura

Kawai

2020

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

In this study, we focus on multimodal language understanding for fetching instructions in the domestic service robots context. This task consists of predicting a target object, as instructed by the user, given an image and an unstructured sentence, such as "Bring me the yellow box (from the wooden cabinet)." This is challenging because of the ambiguity of natural language, i.e., the relevant information may be missing or there might be several candidates. To solve such a task, we propose the multimodal target-source classifier model with attention branches (MTCM-AB), which is an extension of the MTCM [1]. Our methodology uses the attention branch network (ABN) [2] to develop a multimodal attention mechanism based on linguistic and visual inputs. Experimental validation using a standard dataset showed that the MTCM-AB outperformed both state-of-the-art methods and the MTCM. In particular, the MTCM-AB accuracy was 90.1% on average while human performance was 90.3% on the PFN-PIC dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

A Multimodal Target-Source Classifier With Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects

Magassouba

Sugiura

Kawai

2020

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

show abstract

“…This approach allows to ground certain parts of an image with linguistic constituents. Shridhar and Hsu (2018) consider the task where a robot arm has to pick up a certain object based on a given command. This is accomplished by creating captions for extracted regions from a RPN and clustering them together with the original command.…”

Section: Grounding In Human-robot Interactionmentioning

confidence: 99%

Talk2Car: Taking Control of Your Self-Driving Car

Deruyttere¹,

Vandenhende²,

Grujicic³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

A long-term goal of artificial intelligence is to have an agent execute commands communicated through natural language. In many cases the commands are grounded in a visual environment shared by the human who gives the command and the agent. Execution of the command then requires mapping the command into the physical visual space, after which the appropriate action can be taken. In this paper we consider the former. Or more specifically, we consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. Our work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars. We provide a detailed comparison with related datasets such as ReferIt, RefCOCO, RefCOCO+, RefCOCOg, Cityscape-Ref and CLEVR-Ref. Additionally, we include a performance analysis using strong state-ofthe-art models. The results show that the proposed object referral task is a challenging one for which the models show promising results but still require additional research in natural language processing, computer vision and the intersection of these fields. The dataset can be found on our website: http:// macchina-ai.eu/

show abstract

“…Like the authors of many studies in the field of robotics, we are interested in fetching tasks in daily-life environments. Recent studies have handled multimodal language understanding using multimodal similarity-based integration [4]- [7]. The approach proposed in [4] uses an LSTM to learn the probability of a referring expression, while a unified framework for referring expression generation and comprehension was proposed in [5], and introduced to robotics in [6].…”

Section: Related Workmentioning

confidence: 99%

“…Unfortunately, such systems are timeconsuming and cumbersome especially when considering home environments and non-expert users. Alternatively, recent studies have combined visual and linguistic knowledge by taking a multimodal similarity-based integration National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika, Soraku, Kyoto 619-0289, Japan name.surname@nict.go.jp approach, which uses cosine similarity between linguistic and visual information [4]- [7]. In this approach, visual and linguistic inputs are handled by convolutional neural networks (CNNs) and long short-term memory (LSTM).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target–Source Classification

Magassouba

Sugiura

Quoc

et al. 2019

IEEE Robot. Autom. Lett.

View full text Add to dashboard Cite

In this paper, we address multimodal language understanding with unconstrained fetching instruction for domestic service robots. A typical fetching instruction such as "Bring me the yellow toy from the white shelf" requires to infer the user intention, i.e., what object (target) to fetch and from where (source). To solve the task, we propose a Multimodal Target-source Classifier Model (MTCM), which predicts the region-wise likelihood of target and source candidates in the scene. Unlike other methods, MTCM can handle regionwise classification based on linguistic and visual features. We evaluated our approach that outperformed the state-of-the-art method on a standard data set. We also extended MTCM with Generative Adversarial Nets (MTCM-GAN), and enabled simultaneous data augmentation and classification.

show abstract

Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

Cited by 113 publications

References 28 publications

A Multimodal Target-Source Classifier With Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects

A Multimodal Target-Source Classifier With Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects

Talk2Car: Taking Control of Your Self-Driving Car

Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target–Source Classification

Contact Info

Product

Resources

About