SDC-Net: Video Prediction Using Spatially-Displaced Convolution

Reda, Fitsum A.; Liu, Guilin; Shih, Kevin J.; Kirby, Robert M.; Barker, Jon; Tarjan, David; Tao, Andrew; Catanzaro, Bryan

doi:10.1007/978-3-030-01234-2_44

Cited by 98 publications

(58 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We hope our approach inspires other ways to perform data augmentation, such as GANs [26], to enable cheap dataset collection and achieve improved accuracy in target tasks. For future work, we would like to explore soft label relaxation using the learned kernels in [34] for better uncertainty reasoning. Our state-of-the-art implementation, will be made publicly available to the research community.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Zhu

Sapra

Reda

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

361

296

View full text Add to dashboard Cite

Semantic segmentation requires large amounts of pixelwise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples leads to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018. Our code and videos can be found at https://nv-adlr.github. io/publication/2018-Segmentation.

show abstract

Section: Resultsmentioning

confidence: 99%

“…In our implementation, we use the vector-based architecture as described in [34]. G is a fully convolutional U-net architecture, complete with an encoder and decoder and skip connections between encoder/decoder layers of the same output dimensions.…”

Section: A Implementation Details Of Our Video Prediction/reconstrucmentioning

confidence: 99%

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Zhu

Sapra

Reda

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

361

296

View full text Add to dashboard Cite

show abstract

“…On the contrary, motion-based methods [8,9] excel in making sharp predictions, yet fail in occlusion areas where motion predictions are erroneous or ill-defined. Meanwhile, Reda et al [34] propose to model moving appearances with both convolutional kernels as in [10] and vectors as optical flow. Our closest prior work is [11] which also composes the pixel-and flow-based predictions through occlusion maps.…”

Section: High-fidelity Video Predictionmentioning

confidence: 99%

Disentangling Propagation and Generation for Video Prediction

Gao

Cai³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

A dynamic scene has two types of elements: those that move fluidly and can be predicted from previous frames, and those which are disoccluded (exposed) and cannot be extrapolated. Prior approaches to video prediction typically learn either to warp or to hallucinate future pixels, but not both. In this paper, we describe a computational model for high-fidelity video prediction which disentangles motion-specific propagation from motion-agnostic generation. We introduce a confidence-aware warping operator which gates the output of pixel predictions from a flow predictor for non-occluded regions and from a context encoder for occluded regions. Moreover, in contrast to prior works where confidence is jointly learned with flow and appearance using a single network, we compute confidence after a warping step, and employ a separate network to inpaint exposed regions. Empirical results on both synthetic and real datasets show that our disentangling approach provides better occlusion maps and produces both sharper and more realistic predictions compared to strong baselines.

show abstract

“…[28] untangles the memory of the past from the prediction of the future by learning to predict sampling kernels. [22] combines flow-based and kernel-based approaches to learn a model to predict a motion vector and a kernel simultaneously for each pixel.…”

Section: Related Workmentioning

confidence: 99%

Zoom-In-To-Check: Boosting Video Interpolation via Instance-Level Discrimination

Yuan

Chen

Liu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We propose a light-weight video frame interpolation algorithm. Our key innovation is an instance-level supervision that allows information to be learned from the highresolution version of similar objects. Our experiment shows that the proposed method can generate state-of-the-art results across different datasets, with fractional computation resources (time and memory) of competing methods.Given two image frames, a cascade network creates an intermediate frame with 1) a flow-warping module that computes coarse bi-directional optical flow and creates an interpolated image via flow-based warping, followed by 2) an image synthesis module to make fine-scale corrections. In the learning stage, object detection proposals are generated on the interpolated image. Lower resolution objects are zoomed into, and the learning algorithms using an adversarial loss trained on high-resolution objects to guide the system towards the instance-level refinement corrects details of object shape and boundaries. * indicates equal contribution Supplementary video: https://youtu.be/q-_wIRq26DY.

show abstract

SDC-Net: Video Prediction Using Spatially-Displaced Convolution

Cited by 98 publications

References 36 publications

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Disentangling Propagation and Generation for Video Prediction

Zoom-In-To-Check: Boosting Video Interpolation via Instance-Level Discrimination

Contact Info

Product

Resources

About