2021
DOI: 10.48550/arxiv.2109.10571
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Abstract: Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implemented. To solve the proposed task, an audio-visual framework is proposed for visual localization and sound recognition… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 12 publications
0
1
0
Order By: Relevance
“…[39] modeled a shared space for visual attributes and linguistic concepts, and grounded objects by similarities. Besides visual-linguistic inputs, [245] introduced additional audio information of objects to finish grounding, since sometimes only visual information is not sufficient (e.g. an opaque bottle with different substances in it).…”
Section: Interactive Grasp Synthesismentioning
confidence: 99%
“…[39] modeled a shared space for visual attributes and linguistic concepts, and grounded objects by similarities. Besides visual-linguistic inputs, [245] introduced additional audio information of objects to finish grounding, since sometimes only visual information is not sufficient (e.g. an opaque bottle with different substances in it).…”
Section: Interactive Grasp Synthesismentioning
confidence: 99%