Learning Object Permanence from Video

Shamsian, Aviv; Kleinfeld, Ofri; Globerson, Amir; Chechik, Gal

doi:10.48550/arxiv.2003.10469

Cited by 1 publication

(6 citation statements)

References 33 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The number of attention heads for the Transformer (and Hopper-transformer) was set to 2, the number of transformer layers was set to 5 to match the 5 hops in our Multi-hop Transformer, and the Transformer dropout rate was set to 0.1. For OPNet related experiments, we used the implementation provided from authors (Shamsian et al, 2020). We verified we could reproduce their results under 24 FPS on CATER by using their provided code and trained models.…”

Section: H3 Baselinesmentioning

confidence: 99%

“…Outputs from DETR are transformed object representations that are used as inputs to a multilayer perceptron (MLP) to predict the bounding box and class label of every object. For Snitch Localization, DETR is trained on object annotations from LA-CATER (Shamsian et al, 2020).…”

Section: Object Detection and Representationmentioning

confidence: 99%

“…The CNN backbone we utilized is the pre-trained ResNeXt-101 (Xie et al, 2017) model from Ma et al (2018). We trained DETR (Carion et al, 2020) 5 on LA-CATER (Shamsian et al, 2020) which is a dataset with generated videos following the same configuration to the one used by CATER, but additional ground-truth object bounding box location and class label annotations are available (Shamsian et al (2020) predicts the bounding box of snitch in the video given the supervision of the bounding box of the snitch in 300 frames). We followed the settings in Carion et al (2020) to set up and train DETR, e.g., stacking 6 transformer encoder layers and 6 transformer decoder layers, utilizing the object detection set prediction loss and the auxiliary decoding loss per decoder layer.…”

Section: H Implementation Details H1 Ho P P E Rmentioning

confidence: 99%

“…To mitigate this particular issue of the DETR object detector trained on Shamsian et al (2020), we further compute an object visibility map V ∈ R N T ×1 , which is a binary vector and determined by a heuristic: an object is visible if the bounding box of the object is not completely contained by any bounding box of another object in that frame. The 'Masking()' function uses V by considering only the visible objects.…”

Section: H Implementation Details H1 Ho P P E Rmentioning

confidence: 99%

“…This requires detecting the temporal order of one or more actions of objects. Furthermore, it also requires learning object permanence, since it requires the ability to predict the location of non-visible objects as they are occluded, contained or carried by other objects (Shamsian et al, 2020). Hence, solving this task requires compositional, multi-step spatiotemporal reasoning which has been difficult to achieve using existing deep learning models (Bottou, 2014;Lake et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Zhou,

Kadav,

Lai

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can perform long-term reasoning by building a CATER-h dataset 1 that requires multi-step reasoning to localize objects of interest correctly. * Work done as a NEC Labs intern.

show abstract

Section: H3 Baselinesmentioning

confidence: 99%

Section: Object Detection and Representationmentioning

confidence: 99%

Section: H Implementation Details H1 Ho P P E Rmentioning

confidence: 99%

Section: H Implementation Details H1 Ho P P E Rmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Zhou,

Kadav,

Lai

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Learning Object Permanence from Video

Cited by 1 publication

References 33 publications

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Contact Info

Product

Resources

About