3D-RelNet: Joint Object and Relational Network for 3D Prediction

Kulkarni, Nilesh; Misra, Ishan; Tulsiani, Shubham; Gupta, Abhinav

doi:10.1109/iccv.2019.00230

Cited by 48 publications

(48 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These extensions have also been combined by Lin et al ( 2020 ). 3D-RelNet is also an object-centric model that predicts a pose for each object and their relation to the other objects in the scene (Kulkarni et al, 2019 ). While these approaches seem promising, in their current implementation they only consider video data from a fixed camera viewpoint.…”

Section: Discussionmentioning

confidence: 99%

“…While a lot of research on learning generative models of the environment has been performed, most of them only consider individual objects (Sitzmann et al, 2019b ; Häni et al, 2020 ), consider scenes with a fixed camera viewpoint (Kosiorek et al, 2018 ; Kulkarni et al, 2019 ; Lin et al, 2020 ) or train a separate neural network for each novel scene (Mildenhall et al, 2020 ; Sitzmann et al, 2020 ). We tackle the problem of an active agent that can control the extrinsic parameters of an RGB camera as an active vision system.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Active Vision for Robot Manipulators Using the Free Energy Principle

et al. 2021

View full text Add to dashboard Cite

Occlusions, restricted field of view and limited resolution all constrain a robot's ability to sense its environment from a single observation. In these cases, the robot first needs to actively query multiple observations and accumulate information before it can complete a task. In this paper, we cast this problem of active vision as active inference, which states that an intelligent agent maintains a generative model of its environment and acts in order to minimize its surprise, or expected free energy according to this model. We apply this to an object-reaching task for a 7-DOF robotic manipulator with an in-hand camera to scan the workspace. A novel generative model using deep neural networks is proposed that is able to fuse multiple views into an abstract representation and is trained from data by minimizing variational free energy. We validate our approach experimentally for a reaching task in simulation in which a robotic agent starts without any knowledge about its workspace. Each step, the next view pose is chosen by evaluating the expected free energy. We find that by minimizing the expected free energy, exploratory behavior emerges when the target object to reach is not in view, and the end effector is moved to the correct reach position once the target is located. Similar to an owl scavenging for prey, the robot naturally prefers higher ground for exploring, approaching its target once located.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Active Vision for Robot Manipulators Using the Free Energy Principle

et al. 2021

View full text Add to dashboard Cite

show abstract

“…The most relevant works to us are [22,46,19,9], which take a single image as input and reconstruct multiple object shapes in a scene. However, the methods [22,46,19] are designed for voxel reconstruction with limited resolution. Mesh R-CNN [9] produces object meshes, but still treats objects as isolated geometries without considering the scene context (room layout, object locations, etc.).…”

Section: Related Workmentioning

confidence: 99%

“…In indoor environments, object poses generally follow a set of interior design principles, making it a latent pattern that can be learned. By parsing images, previous works either predict 3D boxes object-wisely [14,46] or only consider pair-wise relations [19]. In our work, we assume each object has a multi-lateral relation between its surroundings, and take all in-room objects into account in predicting its bounding box.…”

Section: D Object Detection and Layout Estimationmentioning

confidence: 99%

“…* Corresponding author: hanxiaoguang@cuhk.edu.cn Figure 1: From a single image (left), we simultaneously predict the contextual knowledge including room layout, camera pose, and 3D object bounding boxes (middle) and reconstruct object meshes (right). layout and object locations) for scene reconstruction, but most methods currently adopt depth or voxel representations [39,22,46,19]. Voxel-grid presents better shape description than boxes, but its resolution is still limited, and the improvement of voxel quality exponentially increases the computational cost, which is more obvious in scenelevel reconstruction.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image

Nie

Han

Guo

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

186

216

View full text Add to dashboard Cite

Semantic reconstruction of indoor scenes refers to both scene understanding and object reconstruction. Existing works either address one part of this problem or focus on independent objects. In this paper, we bridge the gap between understanding and reconstruction, and propose an end-to-end solution to jointly reconstruct room layout, object bounding boxes and meshes from a single image. Instead of separately resolving scene understanding and object reconstruction, our method builds upon a holistic scene context and proposes a coarse-to-fine hierarchy with three components: 1. room layout with camera pose; 2. 3D object bounding boxes; 3. object meshes. We argue that understanding the context of each component can assist the task of parsing the others, which enables joint understanding and reconstruction. The experiments on the SUN RGB-D and Pix3D datasets demonstrate that our method consistently outperforms existing methods in indoor layout estimation, 3D object detection and mesh reconstruction.

show abstract