Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

Sigurdsson, Gunnar A.; Varol, Gül; Wang, Xiaolong; Farhadi, Ali; Laptev, Ivan; Gupta, Abhinav

doi:10.1007/978-3-319-46448-0_31

Cited by 675 publications

(446 citation statements)

References 35 publications

Supporting

Mentioning

442

Contrasting

Unclassified

Order By: Relevance

“…Thus we believe that our problem is a natural reflection of the kinds of learning that people employ to learn to recognize newly named objects. Contemporary to our work, Sigurdsson et al (2016) proposed an interesting new dataset Charades in which hundreds of people record videos in their homes acting casual every activities. We leave the application of our method to this dataset for future work.…”

Section: Discussionmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

View full text Add to dashboard Cite

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can codiscover multiple object instances of different classes within a single video clip. Although the semantic information employed is usually simple and weak, it can greatly boost performance by constraining the hypothesized object locations. Experiments show promising results on three datasets: an average IoU score of 0.423 on a new dataset with 15 object

show abstract

Section: Discussionmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

View full text Add to dashboard Cite

show abstract

“…To evaluate the effectiveness of our temporal reasoning graph, we perform extensive experiments on three benchmark datasets for activity recognition: Something-Something V1 [9], V2 [16] and Charades [29]. We first introduce the two datasets and implementation detail.…”

Section: Methodsmentioning

confidence: 99%

Temporal Reasoning Graph for Activity Recognition

Zhang

Shen

et al. 2020

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

Despite great success has been achieved in activity analysis, it still has many challenges. Most existing work in activity recognition pay more attention to design efficient architecture or video sampling strategy. However, due to the property of fine-grained action and long term structure in video, activity recognition is expected to reason temporal relation between video sequences. In this paper, we propose an efficient temporal reasoning graph (TRG) to simultaneously capture the appearance features and temporal relation between video sequences at multiple time scales. Specifically, we construct learnable temporal relation graphs to explore temporal relation on the multi-scale range. Additionally, to facilitate multi-scale temporal relation extraction, we design a multihead temporal adjacent matrix to represent multi-kinds of temporal relations. Eventually, a multi-head temporal relation aggregator is proposed to extract the semantic meaning of those features convolving through the graphs. Extensive experiments are performed on widely-used large-scale datasets, such as Something-Something and Charades, and the results show that our model can achieve state-of-the-art performance. Further analysis shows that temporal relation reasoning with our TRG can extract discriminative features for activity recognition.

show abstract

“…Charades [25]: is an untrimmed and multi-action dataset, containing 11,848 videos split into 7985 for training, 1863 for validation, and 2,000 for testing. It has 157 action categories, with several fine-grained categories.…”

Section: Datasetsmentioning

confidence: 99%

“…In the classification task, we concatenate the two-stream features and apply a sliding window pooling scheme to create multiple descriptors. Following the evaluation protocol in [25], we use the output probability of the classifier to be the score of the sequence. In the detection task, we consider the evaluation method with post-processing proposed in [81], which uses the averaged prediction score of a temporal window around each temporal pivots.…”

Section: Action Recognition/detection In Untrimmed Videosmentioning

confidence: 99%

Discriminative Video Representation Learning Using Support Vector Classifiers

Wang

Cherian²

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Most popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action-indeed, many are common across multiple actions-pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to tackle this problem, we propose discriminative pooling, based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action. To identify these useful features, we resort to a negative bag consisting of features that are known to be irrelevant, for example, they are sampled either from datasets that are unrelated to our actions of interest or are CNN features produced via random noise as input. With the features from the video as a positive bag and the irrelevant features as the negative bag, we cast an objective to learn a (nonlinear) hyperplane that separates the unknown useful features from the rest in a multiple instance learning formulation within a support vector machine setup. We use the parameters of this separating hyperplane as a descriptor for the full video segment. Since these parameters are directly related to the support vectors in a max-margin framework, they can be treated as a weighted average pooling of the features from the bags, with zero weights given to non-support vectors. Our pooling scheme is end-to-end trainable within a deep learning framework. We report results from experiments on eight computer vision benchmark datasets spanning a variety of video-related tasks and demonstrate state-of-the-art performance across these tasks.Index Terms-video representation, video data mining, discriminative pooling, action recognition, deep learning. ✦• Jue Wang is with the

show abstract

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

Cited by 675 publications

References 35 publications

Sentence Directed Video Object Codiscovery

Sentence Directed Video Object Codiscovery

Temporal Reasoning Graph for Activity Recognition

Discriminative Video Representation Learning Using Support Vector Classifiers

Contact Info

Product

Resources

About