Robotics: Science and Systems XIV 2018
DOI: 10.15607/rss.2018.xiv.028
|View full text |Cite
|
Sign up to set email alerts
|

Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

Abstract: Abstract-This paper presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disambiguate referring expressions interactively. To achieve these, we take the approach of grounding by genera… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
96
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 113 publications
(99 citation statements)
references
References 28 publications
(50 reference statements)
0
96
0
Order By: Relevance
“…Similarly to many studies, we are interested in understanding fetching instructions in everyday environments. Recent studies have addressed multimodal language understanding (MLU) by using visual semantic embedding for visual grounding [1], [10]- [13], visual question answering [14] or caption generation [15]. This approach embeds the visual and linguistic features into a common latent space.…”
Section: Related Workmentioning
confidence: 99%
“…Similarly to many studies, we are interested in understanding fetching instructions in everyday environments. Recent studies have addressed multimodal language understanding (MLU) by using visual semantic embedding for visual grounding [1], [10]- [13], visual question answering [14] or caption generation [15]. This approach embeds the visual and linguistic features into a common latent space.…”
Section: Related Workmentioning
confidence: 99%
“…This approach allows to ground certain parts of an image with linguistic constituents. Shridhar and Hsu (2018) consider the task where a robot arm has to pick up a certain object based on a given command. This is accomplished by creating captions for extracted regions from a RPN and clustering them together with the original command.…”
Section: Grounding In Human-robot Interactionmentioning
confidence: 99%
“…Like the authors of many studies in the field of robotics, we are interested in fetching tasks in daily-life environments. Recent studies have handled multimodal language understanding using multimodal similarity-based integration [4]- [7]. The approach proposed in [4] uses an LSTM to learn the probability of a referring expression, while a unified framework for referring expression generation and comprehension was proposed in [5], and introduced to robotics in [6].…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, such systems are timeconsuming and cumbersome especially when considering home environments and non-expert users. Alternatively, recent studies have combined visual and linguistic knowledge by taking a multimodal similarity-based integration National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika, Soraku, Kyoto 619-0289, Japan name.surname@nict.go.jp approach, which uses cosine similarity between linguistic and visual information [4]- [7]. In this approach, visual and linguistic inputs are handled by convolutional neural networks (CNNs) and long short-term memory (LSTM).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation