2017
DOI: 10.1007/s11263-017-1013-y
|View full text |Cite
|
Sign up to set email alerts
|

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Abstract: Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

4
291
1
6

Year Published

2017
2017
2018
2018

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 322 publications
(302 citation statements)
references
References 46 publications
4
291
1
6
Order By: Relevance
“…In [27] an autoencoder-like LSTM architecture is proposed such that either the current frame or the next frame is accurately reconstructed. Finally, the authors of [32] propose an LSTM with a temporal attention model for densely labelling video frames.…”
Section: Related Workmentioning
confidence: 99%
“…In [27] an autoencoder-like LSTM architecture is proposed such that either the current frame or the next frame is accurately reconstructed. Finally, the authors of [32] propose an LSTM with a temporal attention model for densely labelling video frames.…”
Section: Related Workmentioning
confidence: 99%
“…In [3], the detector is trained on CNN features extracted from the action tubes in space-time; however, evaluation is on relatively short video clips (i.e., several hundred frames) of relatively short actions. In [27] an LSTM is trained that takes CNN features of multiple neighboring frames as input to detect actions at every frame; while their model is similar to ours, they focus on detecting simple actions such as stand up that last only for a few video frames, and the training loss accounts only for classification errors. In this work, we focus on accurately localizing activities that are long and complex by learning and enforcing activity progression as part of LSTM learning objective.…”
Section: Related Workmentioning
confidence: 99%
“…We use the LSTM described in [16] that applies dropout on non-recurrent connections. A similar model has been used in [27] for detecting relatively short actions. Our key contributions are in exploring the rank losses, during training, that encourage monotinicity in detection score and margin produced by the model as a training activity progresses.…”
Section: Related Workmentioning
confidence: 99%
“…Yeung et al uses recurrent neural network-based models for dense activity recognition [66] and action detection from frame glimpses [67]. Haque et al [24] use a Recurrent Attention Model (RAM) to re-identify humans.…”
Section: Related Workmentioning
confidence: 99%