Abstract-Learning object models in the wild from natural human interactions is an essential ability for robots to perform general tasks. In this paper we present a robocentric multimodal dataset addressing this key challenge. Our dataset focuses on interactions where the user teaches new objects to the robot in various ways. It contains synchronized recordings of visual (3 cameras) and audio data which provide a challenging evaluation framework for different tasks.Additionally, we present an end-to-end system that learns object models using object patches extracted from the recorded natural interactions. Our proposed pipeline follows these steps: (a) recognizing the interaction type, (b) detecting the object that the interaction is focusing on, and (c) learning the models from the extracted data. Our main contribution lies in the steps towards identifying the target object patches of the images. We demonstrate the advantages of combining language and visual features for the interaction recognition and use multiple views to improve the object modelling.Our experimental results show that our dataset is challenging due to occlusions and domain change with respect to typical object learning frameworks. The performance of common outof-the-box classifiers trained on our data is low. We demonstrate that our algorithm outperforms such baselines.
This work describes and explores novel steps towards activity recognition from an egocentric point of view. Activity recognition is a broadly studied topic in computer vision, but the unique characteristics of wearable vision systems present new challenges and opportunities. We evaluate a challenging new publicly available dataset that includes trajectories of different users across two indoor environments performing a set of more than 20 different activities. The visual features studied include compact and global image descriptors, including GIST and a novel skin segmentation based histogram signature, and state-of-the art image representations for recognition, including Bag of SIFT words and Convolutional Neural Network (CNN) based features. Our experiments show that simple and compact features provide reasonable accuracy to obtain basic activity information (in our case, manipulation vs. nonmanipulation). However, for finer grained categories CNNbased features provide the most promising results. Future steps include integration of depth information with these features and temporal consistency into the pipeline.
Human action recognition systems are typically focused on identifying different actions, rather than fine grained variations of the same action. This work explores strategies to identify different pointing directions in order to build a natural interaction system to guide autonomous systems such as drones. Commanding a drone with hand-held panels or tablets is common practice but intuitive user-drone interfaces might have significant benefits. The system proposed in this work just requires the user to provide occasional high-level navigation commands by pointing the drone towards the desired motion direction. Due to the lack of data on these settings, we present a new benchmarking video dataset to validate our framework and facilitate future research on the area. Our results show good accuracy for pointing direction recognition, while running at interactive rates and exhibiting robustness to variability in user appearance, viewpoint, camera distance and scenery.
Learning new concepts, such as object models, from humanrobot interactions entails different recognition capabilities on a robotic platform. This work proposes a hierarchical approach to address the extra challenges from natural interaction scenarios by exploiting multimodal data. First, a speech-guided recognition of the type of interaction happening is presented. This first step facilitates the following segmentation of relevant visual information to learn the target object model. Our approach includes three complementary strategies to find Regions of Interest (RoI) depending on the interaction type: Point, Show or Speak. We run an exhaustive validation of the proposed strategies using the recently published Multimodal Human-Robot Interaction dataset [1]. The currently presented pipeline is built on the pipeline proposed with the dataset and provides a more complete baseline for target object segmentation on all its recordings.
In order to perform complex tasks in realistic human environments, robots need to be able to learn new concepts in the wild, incrementally, and through their interactions with humans. This paper presents an end-to-end pipeline to learn object models incrementally during the human-robot interaction.The pipeline we propose consists of three parts: (a) recognizing the interaction type, (b) detecting the object that the interaction is targeting, and (c) learning incrementally the models from data recorded by the robot sensors. Our main contributions lie in the target object detection, guided by the recognized interaction, and in the incremental object learning. The novelty of our approach is the focus on natural, heterogeneous and multimodal human-robot interactions to incrementally learn new object models. Throughout the paper we highlight the main challenges associated with this problem, such as high degree of occlusion and clutter, domain change, low resolution data and interaction ambiguity. Our work shows the benefits of using multi-view approaches and combining visual and language features, and our experimental results outperform standard baselines.Note to Practitioners-This work was motivated by challenges in recognition tasks for dynamic and varying scenarios. Our approach learns to recognize new user interactions and objects. To do so, we use multimodal data from the user-robot interaction: visual data is used to learn the objects and speech is used to learn the label and help with the interaction type recognition. We use state-of-the-art deep learning models to segment the user and the objects in the scene. Our algorithm for incremental learning is based on a classic incremental clustering approach.The pipeline we propose works with all sensors mounted on the robot, so it allows mobility on the system. Our work uses data recorded from a Baxter robot, which enables the use of the manipulation arms in future steps, but it would work with any robot able to have the same sensors mounted. The sensors used are two RGB-D cameras and a microphone. The pipeline currently has high computational requirements to run the two deep learning based steps. We have tested it with a desktop computer including a GTX 1060 and 32GB of RAM.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.