Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction

Singh, Gurkirt; Saha, Suman; Sapienza, Michael; Torr, Philip H. S.; Cuzzolin, Fabio

doi:10.1109/iccv.2017.393

Cited by 264 publications

(399 citation statements)

References 63 publications

Supporting

Mentioning

373

Contrasting

Unclassified

Order By: Relevance

“…Method Accuracy Temporal Fusion [11] 86.0 ROAD [47] 92.0 ROAD + BroxFlow [47] 90.0 RBF-RNN [45] 98.0 Proposed 98.9 Table 2. Action anticipation results for UCF101-24 considering 50% of frames from each video.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Predicting the Future: A Jointly Learnt Model for Action Anticipation

Gammulle

Denman

Sridharan³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Inspired by human neurological structures for action anticipation, we present an action anticipation model that enables the prediction of plausible future actions by forecasting both the visual and temporal future. In contrast to current state-of-the-art methods which first learn a model to predict future video features and then perform action anticipation using these features, the proposed framework jointly learns to perform the two tasks, future visual and temporal representation synthesis, and early action anticipation. The joint learning framework ensures that the predicted future embeddings are informative to the action anticipation task. Furthermore, through extensive experimental evaluations we demonstrate the utility of using both visual and temporal semantics of the scene, and illustrate how this representation synthesis could be achieved through a recurrent Generative Adversarial Network (GAN) framework. Our model outperforms the current state-of-the-art methods on multiple datasets: UCF101, UCF101-24, UT-Interaction and TV Human Interaction. 1

show abstract

Section: Methodsmentioning

confidence: 99%

“…UCF101-24 [47] is a subset of the UCF101 dataset. It is composed of 24 action classes in 3207 videos.…”

Section: Datasetsmentioning

confidence: 99%

Predicting the Future: A Jointly Learnt Model for Action Anticipation

Gammulle

Denman

Sridharan³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…An extra dropout layer is further added with dropout ratio 0.5 before the softmax/sigmoid layer. Following [17,24,28,38], we also exploit a two-stream pipeline for utilizing multiple modalities, where the RGB frame and the stacked optical flow "image" are considered. To fuse the detection results, late fusion scheme is taken to average the classification scores.…”

Section: Implementationsmentioning

confidence: 99%

“…For fair comparisons, we also utilize ResNet101 [11] as the backbone in our TPN. Following [17,24,28,38], we report the performance of LSTR on the late fusion of RGB images and optical flow images inputs. Table 4 summarizes video-mAP performances on UCF-Sports, J-HMDB (3 splits) and UCF-101 datasets with different IoU thresholds δ .…”

Section: Comparison With State-of-the-artmentioning

confidence: 99%

Long Short-Term Relation Networks for Video Action Detection

Yao

Qiu

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

It has been well recognized that modeling human-object or objectobject relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as humancontext) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.

show abstract

“…Singh et al [30] recently developed a method that generates candidate action bounding boxes in frames based on appearance and flow. These bounding boxes are incrementally grouped in action tubes, and those (partial) tubes get a class probability with Viterbi's algorithm.…”

Section: Related Workmentioning

confidence: 99%

Modeling Temporal Structure with LSTM for Online Action Detection

Geest

Tuytelaars

2018

2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

Online action detection is a challenging problem: a system needs to decide what action is happening at the current frame, based on previous frames only. Fortunately in real-life, human actions are not independent from one another: there are strong (long-term) dependencies between them. An online action detection method should be able to capture these dependencies, to enable a more accurate early detection.At first sight, an LSTM seems very suitable for this problem. It is able to model both short-term and long-term patterns. It takes its input one frame at the time, updates its internal state and gives as output the current class probabilities. In practice, however, the detection results obtained with LSTMs are still quite low. In this work, we start from the hypothesis that it may be too difficult for an LSTM to learn both the interpretation of the input and the temporal patterns at the same time. We propose a two-stream feedback network, where one stream processes the input and the other models the temporal relations. We show improved detection accuracy on an artificial toy dataset and on the Breakfast Dataset [21] and the TVSeries Dataset [7], reallife datasets with inherent temporal dependencies between the actions.

show abstract

Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction

Cited by 264 publications

References 63 publications

Predicting the Future: A Jointly Learnt Model for Action Anticipation

Predicting the Future: A Jointly Learnt Model for Action Anticipation

Long Short-Term Relation Networks for Video Action Detection

Modeling Temporal Structure with LSTM for Online Action Detection

Contact Info

Product

Resources

About