2018
DOI: 10.1109/tcsvt.2017.2764624
|View full text |Cite
|
Sign up to set email alerts
|

Learning From Web Videos for Event Classification

Abstract: Abstract-Traditional approaches for classifying event videos rely on a manually curated training dataset. While this paradigm has achieved excellent results on benchmarks such as TrecVid multimedia event detection (MED) challenge datasets, it is restricted by the effort involved in careful annotation. Recent approaches have attempted to address the need for annotation by automatically extracting images from the web, or generating queries to retrieve videos. In the former case, they fail to exploit additional c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

1
2
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 40 publications
(102 reference statements)
1
2
0
Order By: Relevance
“…Visual concept learning: We find that works by Binder et al[1], Zhou et al[56], and Chesneau et al[4] are closest to ours. While[1,56] focus on recognizing more complex visual concepts, beyond objects in image domain, we introduce win-fail recognition in the video domain for deeper human action understanding.…”
supporting
confidence: 73%
See 1 more Smart Citation
“…Visual concept learning: We find that works by Binder et al[1], Zhou et al[56], and Chesneau et al[4] are closest to ours. While[1,56] focus on recognizing more complex visual concepts, beyond objects in image domain, we introduce win-fail recognition in the video domain for deeper human action understanding.…”
supporting
confidence: 73%
“…While[1,56] focus on recognizing more complex visual concepts, beyond objects in image domain, we introduce win-fail recognition in the video domain for deeper human action understanding. Chesneau et al[4] address recognizing concepts like 'Birthday Party,' 'Grooming an Animal,' and 'Unstuck a Vehicle' in web videos. However, these concepts do not have large intra-class variance like ours, and are less complex and challenging than ours.…”
mentioning
confidence: 99%
“…Self-supervised video representation learning methods utilize the correspondence between multiple data streams so that the generated video representation can put the correlation of various modalities of data into consideration. Nicolas et al [37] automatically collect training set from Web videos according to the given textual description and establish the mapping between the textual description and video representation while our method does not require a lot of textual description. Mahendran et al [22] design an auxiliary task based on the correlation verification of RGB video frames and optical flow.…”
Section: Learning From the Correspondence Between Multiple Data Streamsmentioning
confidence: 99%