Semantically Coherent Co-Segmentation and Reconstruction of Dynamic Scenes

Mustafa, Armin; Hilton, Adrian

doi:10.1109/cvpr.2017.592

Cited by 45 publications

(44 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous studies on wide baseline human performance capture methods [24,33] either use initial sparse reconstruction or visual hulls generated from multi cameras to limit the stereo search space. Other methods for wide baseline semantic reconstruction exploit semantic segmentation constraints to improve the multi-view stereo [23]. In this study, we propose to exploit semantic masking in the stereo matching framework to limit the search region along the Epipolar line to decrease the number of wrong matches from only two camera views.…”

Section: Semantic Stereo Constraintsmentioning

confidence: 99%

Learning Dense Wide Baseline Stereo Matching for People

Caliskan

Mustafa

Imre

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

Existing methods for stereo work on narrow baseline image pairs giving limited performance between wide baseline views. This paper proposes a framework to learn and estimate dense stereo for people from wide baseline image pairs. A synthetic people stereo patch dataset (S2P2) is introduced to learn wide baseline dense stereo matching for people. The proposed framework not only learns human specific features from synthetic data but also exploits pooling layer and data augmentation to adapt to real data. The network learns from the human specific stereo patches from the proposed dataset for wide-baseline stereo estimation. In addition to patch match learning, a stereo constraint is introduced in the framework to solve wide baseline stereo reconstruction of humans. Quantitative and qualitative performance evaluation against state-of-the-art methods of proposed method demonstrates improved wide baseline stereo reconstruction on challenging datasets. We show that it is possible to learn stereo matching from synthetic people dataset and improve performance on real datasets for stereo reconstruction of people from narrow and wide baseline stereo data.

show abstract

Section: Semantic Stereo Constraintsmentioning

confidence: 99%

Learning Dense Wide Baseline Stereo Matching for People

Caliskan

Mustafa

Imre

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

show abstract

“…While correspondence estimation has been extensively studied, there has been a growing trend to extend the idea of matching the same objects across images to matching images covering different instances of an object category. This progress not only attracts substantial attention but also facilitates many real-world applications ranging from object recognition [26], object co-segmentation [2,12,35], to 3D reconstruction [29]. However, due to the presence of background clutter, ambiguity induced by large intra-class variations, and the limited scalability of obtaining large-scale datasets with manually annotated correspondences, semantic matching remains challenging.…”

Section: Introductionmentioning

confidence: 99%

Deep Semantic Matching with Foreground Detection and Cycle-Consistency

Chen

Huang

et al. 2019

Computer Vision – ACCV 2018

View full text Add to dashboard Cite

Establishing dense semantic correspondences between object instances remains a challenging problem due to background clutter, significant scale and pose differences, and large intra-class variations. In this paper, we address weakly supervised semantic matching based on a deep network where only image pairs without manual keypoint correspondence annotations are provided. To facilitate network training with this weaker form of supervision, we 1) explicitly estimate the foreground regions to suppress the effect of background clutter and 2) develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent. We train the proposed model using the PF-PASCAL dataset and evaluate the performance on the PF-PASCAL, PF-WILLOW, and TSS datasets. Extensive experimental results show that the proposed approach performs favorably against the state-of-the-art methods.

show abstract

“…Existing multi-task methods for scene understanding perform per frame joint reconstruction and semantic in- stance segmentation from a single image [25], showing that joint estimation can improve each task. Other methods have fused semantic segmentation with reconstruction [36] or flow estimation [42] demonstrating significant improvement in both semantic segmentation and reconstruction/scene flow. We exploit the joint estimation to understand dynamic scenes by simultaneous reconstruction, flow and segmentation estimation from multiple view video.…”

Section: Introductionmentioning

confidence: 99%

“…However methods in these three categories do not exploit semantic information of the scene. The fourth category of joint estimation methods exploit semantic information by introducing joint semantic segmentation and reconstruction for general dynamic scenes [19,56,27,49,36] and street scenes [13,50]. However these methods give perframe semantic segmentation and reconstruction with no motion estimate leading to unaligned geometry and pixel level incoherence in both segmentation and reconstruction for dynamic sequences.…”

Section: Introductionmentioning

confidence: 99%

“…Methods in the literature have exploited human-pose information to improve results in semantic segmentation [55] and reconstruction [22]. However existing joint methods for dynamic scenes (with multiple people) do not exploit human-pose information often detecting interacting people as a single object [36]. Table 1 shows a comparison between the tasks performed by state-of-the-art methods.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

U4D: Unsupervised 4D Dynamic Scene Understanding

Mustafa

Russell

Hilton

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

We introduce the first approach to solve the challenging problem of unsupervised 4D visual scene understanding for complex dynamic scenes with multiple interacting people from multi-view video. Our approach simultaneously estimates a detailed model that includes a per-pixel semantically and temporally coherent reconstruction, together with instance-level segmentation exploiting photoconsistency, semantic and motion information. We further leverage recent advances in 3D pose estimation to constrain the joint semantic instance segmentation and 4D temporally coherent reconstruction. This enables per person semantic instance segmentation of multiple interacting people in complex dynamic scenes. Extensive evaluation of the joint visual scene understanding framework against state-of-theart methods on challenging indoor and outdoor sequences demonstrates a significant (≈ 40%) improvement in semantic segmentation, reconstruction and scene flow accuracy.

show abstract

Semantically Coherent Co-Segmentation and Reconstruction of Dynamic Scenes

Cited by 45 publications

References 58 publications

Learning Dense Wide Baseline Stereo Matching for People

Learning Dense Wide Baseline Stereo Matching for People

Deep Semantic Matching with Foreground Detection and Cycle-Consistency

U4D: Unsupervised 4D Dynamic Scene Understanding

Contact Info

Product

Resources

About