Improving Robot Success Detection using Static Object Data

Scalise, Rosario; Thomason, Jesse; Bisk, Yonatan; Srinivasa, Siddhartha S.

doi:10.1109/iros40897.2019.8968142

Cited by 12 publications

(11 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As computers transition from desktops to pervasive mobile and edge devices, we must make and meet the expectation that NLP can be deployed in any of these contexts. Current representations have very limited utility in even the most basic robotic settings (Scalise et al, 2019), making collaborative robotics (Rosenthal et al, 2010) largely a domain of custom engineering rather than science.…”

Section: Ws4: Embodiment and Actionmentioning

confidence: 99%

Experience Grounds Language

Bisk¹,

Holtzman²,

Thomason³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

161

136

View full text Add to dashboard Cite

Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions it facilitates. Despite the incredible effectiveness of language processing models to tackle tasks after being trained on text alone, successful linguistic communication relies on a shared experience of the world. It is this shared experience that makes utterances meaningful.Natural language processing is a diverse field, and progress throughout its development has come from new representational theories, modeling techniques, data collection paradigms, and tasks. We posit that the present success of representation learning approaches trained on large, text-only corpora requires the parallel tradition of research on the broader physical and social context of language to address the deeper questions of communication.

show abstract

Section: Ws4: Embodiment and Actionmentioning

confidence: 99%

Experience Grounds Language

Bisk¹,

Holtzman²,

Thomason³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

161

136

View full text Add to dashboard Cite

show abstract

“…Other works attempt to infer actions, rewards, or state-values of human videos and use them for learning predictive models [40] or RL [14,39]. Learning keypoint [51,8] or object/task centric representations from videos [42,38,34] is another promising strategy to learning rewards and representations between domains. Simulation has also been leveraged as supervision to learn such representations [32] or to produce human data with domain randomization [3].…”

Section: B Robotic Learning From Human Videosmentioning

confidence: 99%

Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos

Chen¹,

Nair²,

Finn³

2021

Robotics: Science and Systems XVII

View full text Add to dashboard Cite

We are motivated by the goal of generalist robots that can complete a wide range of tasks across many environments. Critical to this is the robot's ability to acquire some metric of task success or reward, which is necessary for reinforcement learning, planning, or knowing when to ask for help. For a general-purpose robot operating in the real world, this reward function must also be able to generalize broadly across environments, tasks, and objects, while depending only on on-board sensor observations (e.g. RGB images). While deep learning on large and diverse datasets has shown promise as a path towards such generalization in computer vision and natural language, collecting high quality datasets of robotic interaction at scale remains an open challenge. In contrast, "in-the-wild" videos of humans (e.g. YouTube) contain an extensive collection of people doing interesting tasks across a diverse range of settings. In this work, we propose a simple approach, Domain-agnostic Video Discriminator (DVD), that learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task, and can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos. We find that by leveraging diverse human datasets, this reward function (a) can generalize zero shot to unseen environments, (b) generalize zero shot to unseen tasks, and (c) can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.

show abstract

“…The physical forces and sounds objects make during manipulation actions can also be associated with words such as rattling and heavy for multimodal understanding beyond vision [18,36]. Prior work has gathered language annotations for the YCB Benchmark object set [37] to explore how language descriptions provide priors on object affordances [38]. In our tabletop robot experiments, we use camera views of novel objects to evaluate zero shot transfer of LAGOR to the real world with minimal object rotations to achieve language-aligned camera views to select the correct referent object.…”

Section: Datamentioning

confidence: 99%

Language Grounding with 3D Objects

Thomason,

Shridhar,

Bisk

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Seemingly simple natural language requests to a robot are generally underspecified, for example Can you bring me the wireless mouse? When viewing mice on the shelf, the number of buttons or presence of a wire may not be visible from certain angles or positions. Flat images of candidate mice may not provide the discriminative information needed for wireless. The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark ShapeNet Annotated with Referring Expressions (SNARE) 2 requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based [1] models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these models are weaker at understanding the 3D nature of objects -properties which play a key role in manipulation. In particular, we find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.

show abstract

Improving Robot Success Detection using Static Object Data

Cited by 12 publications

References 46 publications

Experience Grounds Language

Experience Grounds Language

Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos

Language Grounding with 3D Objects

Contact Info

Product

Resources

About