Recovering 6D object pose from RGB indoor image based on two-stage detection network with multi-task loss

Liu, Fuchang; Fang, Pengfei; Yao, Zhengwei; Fan, Ran; Pan, Zhigeng; Sheng, Weiguo; Yang, Huansong

doi:10.1016/j.neucom.2018.12.061

Cited by 23 publications

(12 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this case, a trained CCN amortizes the process of finding the exact template through encoding the observation. The most closely related approaches use convolutional neural networks to directly estimate the object pose, and are pretrained on a set of labeled data which can be considered the templates (Do et al, 2018 ; Xiang et al, 2018 ; Liu et al, 2019 ). While these approaches acquire high accuracy results, they are trained supervised with a labeled dataset.…”

Section: Discussionmentioning

confidence: 99%

Embodied Object Representation Learning and Recognition

et al. 2022

View full text Add to dashboard Cite

Scene understanding and decomposition is a crucial challenge for intelligent systems, whether it is for object manipulation, navigation, or any other task. Although current machine and deep learning approaches for object detection and classification obtain high accuracy, they typically do not leverage interaction with the world and are limited to a set of objects seen during training. Humans on the other hand learn to recognize and classify different objects by actively engaging with them on first encounter. Moreover, recent theories in neuroscience suggest that cortical columns in the neocortex play an important role in this process, by building predictive models about objects in their reference frame. In this article, we present an enactive embodied agent that implements such a generative model for object interaction. For each object category, our system instantiates a deep neural network, called Cortical Column Network (CCN), that represents the object in its own reference frame by learning a generative model that predicts the expected transform in pixel space, given an action. The model parameters are optimized through the active inference paradigm, i.e., the minimization of variational free energy. When provided with a visual observation, an ensemble of CCNs each vote on their belief of observing that specific object category, yielding a potential object classification. In case the likelihood on the selected category is too low, the object is detected as an unknown category, and the agent has the ability to instantiate a novel CCN for this category. We validate our system in an simulated environment, where it needs to learn to discern multiple objects from the YCB dataset. We show that classification accuracy improves as an embodied agent can gather more evidence, and that it is able to learn about novel, previously unseen objects. Finally, we show that an agent driven through active inference can choose their actions to reach a preferred observation.

show abstract

Section: Discussionmentioning

confidence: 99%

Embodied Object Representation Learning and Recognition

et al. 2022

View full text Add to dashboard Cite

show abstract

“…In this setting, traditional method [7]- [9] or learning-based method [10]- [12] are firstly used to extract the 2D features, and then the pose can be calculated with Perspective-n-Point (PnP) algorithms [13]. There are also methods that combine the object detection and 6D pose estimation together and directly regress the 6D object pose from RGB images [14], [15]. Most of those methods are tested on common datasets such as YCB [16] and T-LESS [17] which have rich label information.…”

Section: A Methods For Target Pose Estimation and Model Transfermentioning

confidence: 99%

Calibration-Free Monocular Vision-Based Robot Manipulations With Occlusion Awareness

et al. 2021

View full text Add to dashboard Cite

Vision-based manipulation has been largely used in various robot applications. Normally, in order to obtain the spatial information of the operated target, a carefully calibrated stereo vision system is required. However, it limits the application of robots in the unstructured environment which limits both the number and the pose of the camera. In this study, a calibration-free monocular vision-based robot manipulation approach is proposed based on domain randomization and deep reinforcement learning (DRL). Firstly, a learning strategy combined domain randomization is developed to estimate the spatial information of the target from a single monocular camera arbitrarily mounted in a large area of the manipulation environment. Secondly, to address the monocular occlusion problem which regularly happens during robot manipulations, an occlusion awareness DRL policy has been designed to control the robot to avoid occlusions actively in the manipulation tasks. The performance of our method has been evaluated on two common manipulation tasks, reaching and lifting of a target building block, which show the efficiency and effectiveness of our proposed approach.

show abstract

“…YOLO and SSD (single shot multibox detector) networks can directly return to the target box position without extracting candidate boxes, so they run faster, but the accuracy is not as good as the former. With the continuous upgrading and optimization of the network, there are mainly four versions of the YOLO algorithm [18].…”

Section: Related Workmentioning

confidence: 99%

YOLOMask, an Instance Segmentation Algorithm Based on Complementary Fusion Network

et al. 2021

View full text Add to dashboard Cite

Object detection and segmentation can improve the accuracy of image recognition, but traditional methods can only extract the shallow information of the target, so the performance of algorithms is subject to many limitations. With the development of neural network technology, semantic segmentation algorithms based on deep learning can obtain the category information of each pixel. However, the algorithm cannot effectively distinguish each object of the same category, so YOLOMask, an instance segmentation algorithm based on complementary fusion network, is proposed in this paper. Experimental results on public data sets COCO2017 show that the proposed fusion network can accurately obtain the category and location information of each instance and has good real-time performance.

show abstract

Recovering 6D object pose from RGB indoor image based on two-stage detection network with multi-task loss

Cited by 23 publications

References 6 publications

Embodied Object Representation Learning and Recognition

Embodied Object Representation Learning and Recognition

Calibration-Free Monocular Vision-Based Robot Manipulations With Occlusion Awareness

YOLOMask, an Instance Segmentation Algorithm Based on Complementary Fusion Network

Contact Info

Product

Resources

About