Mid-level Features Improve Recognition of Interactive Activities

Saenko, Kate; Packer, Ben; Chen, C; Bandla, Sunil; Lee, Y; Jia, Yangqing; Niebles, Juan Carlos; Koller, Daphne; Li, Feifei; Grauman, Kristen; Darrell, Trevor

doi:10.21236/ada570728

Cited by 8 publications

(4 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, with the similarity constraints added, our full model dramatically improves on the novel target categories. This again validates our argument that the auxiliary similarity [25] {10, 20, 30, 40} Table 4. PASCAL to VisInt dataset description constraints can be used in conjunction with a domain adaptation algorithm to learn a more generalizable target model.…”

Section: Results and Analysissupporting

confidence: 87%

See 1 more Smart Citation

Semi-supervised Domain Adaptation with Instance Constraints

Donahue

Hoffman

Rodner

et al. 2013

2013 IEEE Conference on Computer Vision and Pattern Recognition

Self Cite

172

View full text Add to dashboard Cite

Section: Results and Analysissupporting

confidence: 87%

“…Our experiments focus on person detectors, due to the wide interest in and broad applications of pedestrian detection. The source domain has images from the PASCAL VOC 2007 dataset [10], and the target domain consists of frames of the videos from the VisInt dataset [25].…”

Section: Object Detection In Videomentioning

confidence: 99%

Semi-supervised Domain Adaptation with Instance Constraints

Donahue

Hoffman

Rodner

et al. 2013

2013 IEEE Conference on Computer Vision and Pattern Recognition

Self Cite

172

View full text Add to dashboard Cite

“…It is a challenge to obtain high-level descriptions from videos, or to combine empirical measurements with expert knowledge and bridge the gap between low-level features and high-level descriptions. Saenko [40] proposed a mid-level representations, that can bridge the gap between existing low-level models, which are incapable of capturing the structure of interactive verbs, and contemporary high-level schemes, which rely on the output of potentially brittle intermediate detectors and trackers. Sadanand [39] presented Action Bank, a high-level representation of video.…”

Section: Literature Surveymentioning

confidence: 99%

Recognition and localization of relevant human behavior in videos

Bouma¹,

Burghouts

Penning

et al. 2013

SPIE Proceedings

View full text Add to dashboard Cite

Ground surveillance is normally performed by human assets, since it requires visual intelligence. However, especially for military operations, this can be dangerous and is very resource intensive. Therefore, unmanned autonomous visualintelligence systems are desired. In this paper, we present an improved system that can recognize actions of a human and interactions between multiple humans. Central to the new system is our agent-based architecture. The system is trained on thousands of videos and evaluated on realistic persistent surveillance data in the DARPA Mind's Eye program, with hours of videos of challenging scenes. The results show that our system is able to track the people, detect and localize events, and discriminate between different behaviors, and it performs 3.4 times better than our previous system.

show abstract

“…It seems intuitively clear that machine vision models should also capture the above reasoning structure, and indeed this has been explored in the past (Gupta & Davis, 2007;Saenko et al, 2012). However, current state-of-the-art video transformer models do not explicitly model objects.…”

Section: Introductionmentioning

confidence: 99%

Object-Region Video Transformers

Herzig¹,

Ben-Avraham²,

Mangalam³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Evidence from cognitive psychology suggests that understanding spatio-temporal object interactions and dynamics can be essential for recognizing actions in complex videos. Therefore, action recognition models are expected to benefit from explicit modeling of objects, including their appearance, interaction, and dynamics. Recently, video transformers have shown great success in video understanding, exceeding CNN performance. Yet, existing video transformer models do not explicitly model objects. In this work, we present Object-Region Video Transformers (ORViT), an object-centric approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric spatio-temporal representations throughout multiple transformer layers. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" element applies self-attention over the patches and object regions. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on standard and compositional action recognition on Something-Something V2, standard action recognition on Epic-Kitchen100 and Diving48, and spatio-temporal action detection on AVA. We show strong improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at https://roeiherz.github.io/ORViT/.

show abstract

Mid-level Features Improve Recognition of Interactive Activities

Cited by 8 publications

References 44 publications

Semi-supervised Domain Adaptation with Instance Constraints

Semi-supervised Domain Adaptation with Instance Constraints

Recognition and localization of relevant human behavior in videos

Object-Region Video Transformers

Contact Info

Product

Resources

About