Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion

Kortylewski, Adam; Liu, Qing; Wang, Angtian; Sun, Yihong; Yuille, Alan

doi:10.1007/s11263-020-01401-3

Cited by 65 publications

(38 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a broader context, our work builds on and extends a recent line of work that follows an approximate analysis-by-synthesis approach to computer vision [49], which formulates vision as an inverse rendering process on the level of neural network features. Several recent works demonstrate that approximate analysis-by-synthesis induces a largely enhanced generalization in out-of-distribution situations such as when objects are partially occluded in image classification [21][22][23]57] and object detection [50], when images are modified through adversarial patches [20], or when objects are viewed from unseen 3D poses [49]. Our work enables the learning of models for approximate analysis-by-synthesis with minimal supervision.…”

Section: Related Workmentioning

confidence: 99%

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Wang¹,

Yuille²,

Kortylewski³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

We study the problem of learning to estimate the 3D object pose from a few labelled examples and a collection of unlabelled data. Our main contribution is a learning framework, neural view synthesis and matching, that can transfer the 3D pose annotation from the labelled to unlabelled images reliably, despite unseen 3D views and nuisance variations such as the object shape, texture, illumination or scene context. In our approach, objects are represented as 3D cuboid meshes composed of feature vectors at each mesh vertex. The model is initialized from a few labelled images and is subsequently used to synthesize feature representations of unseen 3D views. The synthesized views are matched with the feature representations of unlabelled images to generate pseudo-labels of the 3D pose. The pseudo-labelled data is, in turn, used to train the feature extractor such that the features at each mesh vertex are more invariant across varying 3D views of the object. Our model is trained in an EM-type manner alternating between increasing the 3D pose invariance of the feature extractor and annotating unlabelled data through neural view synthesis and matching. We demonstrate the effectiveness of the proposed semi-supervised learning framework for 3D pose estimation on the PASCAL3D+ and KITTI datasets. We find that our approach outperforms all baselines by a wide margin, particularly in an extreme few-shot setting where only 7 annotated images are given. Remarkably, we observe that our model also achieves an exceptional robustness in out-of-distribution scenarios that involve partial occlusion. The code is available at https://github.com/Angtian/NeuralVS.

show abstract

Section: Related Workmentioning

confidence: 99%

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Wang¹,

Yuille²,

Kortylewski³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In fact, the issue of object detection under the influence of occlusion is a challenging task which negatively affects the robustness of most detection algorithms [47]. While current approaches aim to tackle this problem by applying a compositional neural network structure in combination with an occluder model [47][48][49][50], the majority of approaches focus on the problem of partial occlusion and would, therefore, be of limited suitability for this study. Moreover, during the tracking process, the negative effect of object occlusion can be reduced to some extent, by applying a predictive model like the KF algorithm, which internally interprets the CNN detections as noisy measurement information.…”

Section: Pig Detection and Trackingmentioning

confidence: 99%

Detecting Animal Contacts—A Deep Learning-Based Pig Detection and Tracking Approach for the Quantification of Social Contacts

Wutke

Heinrich

Das

et al. 2021

Sensors

View full text Add to dashboard Cite

The identification of social interactions is of fundamental importance for animal behavioral studies, addressing numerous problems like investigating the influence of social hierarchical structures or the drivers of agonistic behavioral disorders. However, the majority of previous studies often rely on manual determination of the number and types of social encounters by direct observation which requires a large amount of personnel and economical efforts. To overcome this limitation and increase research efficiency and, thus, contribute to animal welfare in the long term, we propose in this study a framework for the automated identification of social contacts. In this framework, we apply a convolutional neural network (CNN) to detect the location and orientation of pigs within a video and track their movement trajectories over a period of time using a Kalman filter (KF) algorithm. Based on the tracking information, we automatically identify social contacts in the form of head–head and head–tail contacts. Moreover, by using the individual animal IDs, we construct a network of social contacts as the final output. We evaluated the performance of our framework based on two distinct test sets for pig detection and tracking. Consequently, we achieved a Sensitivity, Precision, and F1-score of 94.2%, 95.4%, and 95.1%, respectively, and a MOTA score of 94.4%. The findings of this study demonstrate the effectiveness of our keypoint-based tracking-by-detection strategy and can be applied to enhance animal monitoring systems.

show abstract

“…[18] additionally predicts the segmentation masks of occluders. [21] integrates compositional models and deep convolutional neural networks into a unified model which is more robust to partial occlusions.…”

Section: Related Workmentioning

confidence: 99%

Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge

Qi¹,

Gao²,

Hu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Although deep learning methods have achieved advanced video object recognition performance in recent years, perceiving heavily occluded objects in a video is still a very challenging task. To promote the development of occlusion understanding, we collect a large-scale dataset called OVIS for video instance segmentation in the occluded scenario. OVIS consists of 296k high-quality instance masks and 901 occluded scenes. While our human vision systems can perceive those occluded objects by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, all baseline methods encounter a significant performance degradation of about 80% in the heavily occluded object group, which demonstrates that there is still a long way to go in understanding obscured objects and videos in a complex real-world scenario.To facilitate the research on new paradigms for video understanding systems, we launched a challenge based on the OVIS dataset. The submitted top-performing algorithms have achieved much higher performance than our baselines. In this paper, we will introduce the OVIS dataset and further dissect it by analyzing the results of baselines and submitted methods. The OVIS dataset and challenge information can be found at http://songbai.site/ovis.

show abstract

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion

Cited by 65 publications

References 42 publications

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Detecting Animal Contacts—A Deep Learning-Based Pig Detection and Tracking Approach for the Quantification of Social Contacts

Occluded Video Instance Segmentation: Dataset and ICCV 2021 Challenge

Contact Info

Product

Resources

About