Untitled

Abstract-This paper presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The core issue here is the grounding of referring expressions: infer objects and their relationships from input images and language expressions. INGRESS allows for unconstrained object categories and unconstrained language expressions. Further, it asks questions to disambiguate referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neuralnetwork model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expression, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred object. The same neural networks are used for both grounding and question generation for disambiguation. Experiments show that INGRESS outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans.

show abstract

INGRESS: Interactive visual grounding of referring expressions

Shridhar

Mittal

Hsu

2020

The International Journal of Robotics Research

View full text Add to dashboard Cite

This article presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The key question here is to ground referring expressions: understand expressions about objects and their relationships from image and natural language inputs. INGRESS allows unconstrained object categories and rich language expressions. Further, it asks questions to clarify ambiguous referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expressions, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred objects. The same neural networks are used for both grounding and question generation for disambiguation. Experiments show that INGRESS outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans. The INGRESS source code is available at https://github.com/MohitShridhar/ingress .

show abstract

XPose: Reinventing User Interaction with Flying Cameras

Lan¹,

Shridhar²,

Hsu³

et al. 2017

View full text Add to dashboard Cite

Abstract-XPose is a new touch-based interactive system for photo taking, designed to take advantage of the autonomous flying capability of a drone-mounted camera. It enables the user to interact with photos directly and focus on taking photos instead of piloting the drone. XPose introduces a two-stage eXplore-and-comPose approach to photo taking in static scenes. In the first stage, the user explores the "photo space" through predefined interaction modes: Orbit, Pano, and Zigzag. Under each mode, the camera visits many points of view (POVs) and takes exploratory photos through autonomous drone flying. In the second stage, the user restores a selected POV with the help of a gallery preview and uses direct manipulation gestures to refine the POV and compose a final photo. Our prototype implementation, based on a Parrot Bebop quadcopter, relies mainly on a single monocular camera and works reliably in a GPS-denied environment. A systematic user study indicates that XPose results in more successful user performances in phototaking tasks than the touchscreen joystick interface widely used in commercial drones today.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mohit Shridhar

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

INGRESS: Interactive visual grounding of referring expressions

XPose: Reinventing User Interaction with Flying Cameras

Contact Info

Product

Resources

About