Semantic image segmentation is a fundamental yet challenging problem, which can be viewed as an extension of the conventional object detection with close relation to image segmentation and classification. It aims to partition images into non-overlapping regions that are assigned predefined semantic labels. Most of the existing approaches utilize and integrate low-level local features and high-level contextual cues, which are fed into an inference framework such as, the conditional random field (CRF). However, the lack of meaning in the primitives (i.e., pixels or superpixels) and the cues provides low discriminatory capabilities, since they are rarely object-consistent. Moreover, blind combinations of heterogeneous features and contextual cues exploitation through limited neighborhood relations in the CRFs tend to degrade the labeling performance. This paper proposes an ontology-based semantic image segmentation (OBSIS) approach that jointly models image segmentation and object detection. In particular, a Dirichlet process mixture model transforms the low-level visual space into an intermediate semantic space, which drastically reduces the feature dimensionality. These features are then individually weighed and independently learned within the context, using multiple CRFs. The segmentation of images into object parts is hence reduced to a classification task, where object inference is passed to an ontology model. This model resembles the way by which humans understand the images through the combination of different cues, context models, and rule-based learning of the ontologies. Experimental evaluations using the MSRC-21 and PASCAL VOC'2010 data sets show promising results.
We propose a novel keypoint voting scheme based on intersecting spheres, that is more accurate than existing schemes and allows for a smaller set of more disperse keypoints. The scheme forms the basis of the proposed RCV-Pose method for 6 DoF pose estimation of 3D objects in RGB-D data, which is particularly effective at handling occlusions. A CNN is trained to estimate the distance between the 3D point corresponding to the depth mode of each RGB pixel, and a set of 3 disperse keypoints defined in the object frame. At inference, a sphere of radius equal to this estimated distance is generated, centered at each 3D point. The surface of these spheres votes to increment a 3D accumulator space, the peaks of which indicate keypoint locations. The proposed radial voting scheme is more accurate than previous vector or offset schemes, and robust to disperse keypoints. Experiments demonstrate RCVPose to be highly accurate and competitive, achieving state-of-the-art results on LINEMOD (99.7%), YCB-Video (97.2%) datasets, and notably scoring +7.9% higher than previous methods on the challenging Occlusion LINEMOD (71.1%) dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.