Learning From Web Videos for Event Classification

Chesneau, Nicolas; Alahari, Karteek; Schmid, Cordelia

doi:10.1109/tcsvt.2017.2764624

Cited by 7 publications

(3 citation statements)

References 40 publications

(102 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visual concept learning: We find that works by Binder et al[1], Zhou et al[56], and Chesneau et al[4] are closest to ours. While[1,56] focus on recognizing more complex visual concepts, beyond objects in image domain, we introduce win-fail recognition in the video domain for deeper human action understanding.…”

supporting

confidence: 73%

“…While[1,56] focus on recognizing more complex visual concepts, beyond objects in image domain, we introduce win-fail recognition in the video domain for deeper human action understanding. Chesneau et al[4] address recognizing concepts like 'Birthday Party,' 'Grooming an Animal,' and 'Unstuck a Vehicle' in web videos. However, these concepts do not have large intra-class variance like ours, and are less complex and challenging than ours.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Win-Fail Action Recognition

Parmar¹,

Morris²

2021

Preprint

View full text Add to dashboard Cite

Current video/action understanding systems have demonstrated impressive performance on large recognition tasks. However, they might be limiting themselves to learning to recognize spatiotemporal patterns, rather than attempting to thoroughly understand the actions. To spur progress in the direction of a truer, deeper understanding of videos, we introduce the task of win-fail action recognition -differentiating between successful and failed attempts at various activities. We introduce a first of its kind paired win-fail action understanding dataset with samples from the following domains: "General Stunts," "Internet Wins-Fails," "Trick Shots," & "Party Games." Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible. We systematically analyze the characteristics of the win-fail task/dataset with prototypical action recognition networks and a novel video retrieval task. While current action recognition methods work well on our task/dataset, they still leave a large gap to achieve high performance. We hope to motivate more work towards the true understanding of actions/videos. Dataset will be available from https://github.com/ParitoshParmar/Win-Fail-Action-Recognition.

show abstract

supporting

confidence: 73%

mentioning

confidence: 99%

Win-Fail Action Recognition

Parmar¹,

Morris²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Self-supervised video representation learning methods utilize the correspondence between multiple data streams so that the generated video representation can put the correlation of various modalities of data into consideration. Nicolas et al [37] automatically collect training set from Web videos according to the given textual description and establish the mapping between the textual description and video representation while our method does not require a lot of textual description. Mahendran et al [22] design an auxiliary task based on the correlation verification of RGB video frames and optical flow.…”

Section: Learning From the Correspondence Between Multiple Data Streamsmentioning

confidence: 99%

Self-Supervised Representation Learning for Videos by Segmenting via Sampling Rate Order Prediction

Huang

Wang

et al. 2022

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Self-supervised representation learning for videos has been very attractive recently because these methods exploit the information inherently obtained from the video itself instead of annotated labels that is quite time-consuming. However, existing methods ignore the importance of global observation while performing spatio-temporal transformation perception, which highly limits the expression capabilities of the video representation. This paper proposes a novel pretext task that combines the temporal information perception of the video with the motion amplitude perception of moving objects to learn the spatiotemporal representation of the video. Specifically, given a video clip containing several video segments, each video segment is sampled by different sampling rates and the order of video segments is disrupted. Then, the network is used to regress the sampling rate of each video segment and classify the order of input video segments. In the pre-training stage, the network can learn rich spatio-temporal semantic information where content-related contrastive learning is introduced to make the learned video representation more discriminative. To alleviate the appearance dependency caused by contrastive learning, we design a novel and robust vector similarity measurement approach, which can take feature alignment into consideration. Moreover, a view synthesis framework is proposed to further improve the performance of contrastive learning by automatically generating reasonable transformed views. We conduct benchmark experiments with several 3D backbone networks on two datasets. The results show that our proposed method outperforms the existing state-of-theart methods across the three backbones on two downstream tasks of human action recognition and video retrieval.

show abstract

Pseudo low rank video representation

Wang

Guo

et al. 2019

Pattern Recognition

View full text Add to dashboard Cite

Learning From Web Videos for Event Classification

Cited by 7 publications

References 40 publications

Win-Fail Action Recognition

Win-Fail Action Recognition

Self-Supervised Representation Learning for Videos by Segmenting via Sampling Rate Order Prediction

Pseudo low rank video representation

Contact Info

Product

Resources

About