Long Short-Term Relation Networks for Video Action Detection

Li, Dong; Yao, Ting; Qiu, Zhaofan; Li, Houqiang; Mei, Tao

doi:10.1145/3343031.3350978

Cited by 22 publications

(11 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, [4,23,44] adopts GCN to build a reasoning module to model the relations between disjoint and distant regions. [21,48] takes dense object proposals as graph nodes and learns the relations between them. [22] treats each object proposal detected in the sample frames as a graph node and then searches adaptive network structures to model the object interactions.…”

Section: Related Workmentioning

confidence: 99%

Representing Videos as Discriminative Sub-graphs for Action Recognition

Liu¹,

Qiu²,

Pan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Human actions are typically of combinatorial structures or patterns, i.e., subjects, objects, plus spatio-temporal interactions in between. Discovering such structures is therefore a rewarding way to reason about the dynamics of interactions and recognize the actions. In this paper, we introduce a new design of sub-graphs to represent and encode the discriminative patterns of each action in the videos. Specifically, we present MUlti-scale Sub-graph LEarning (MUSLE) framework that novelly builds space-time graphs and clusters the graphs into compact sub-graphs on each scale with respect to the number of nodes. Technically, MUSLE produces 3D bounding boxes, i.e., tubelets, in each video clip, as graph nodes and takes dense connectivity as graph edges between tubelets. For each action category, we execute online clustering to decompose the graph into sub-graphs on each scale through learning Gaussian Mixture Layer and select the discriminative sub-graphs as action prototypes for recognition. Extensive experiments are conducted on both Something-Something V1 & V2 and Kinetics-400 datasets, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our MUSLE achieves to-date the best reported accuracy of 65.0% on Something-Something V2 validation set.

show abstract

Section: Related Workmentioning

confidence: 99%

Representing Videos as Discriminative Sub-graphs for Action Recognition

Liu¹,

Qiu²,

Pan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Multi-modal action indeed has extended its popularity to many applications including recognition [47], generative multi-view action [40], detection [16], prediction [50], egocentric action [14], video identification [57], emotion with concept selection [48], personalized recommendation [44], human-object contour [53] and electromyography-vision [41]. The development of a low-cost depth sensor (Microsoft Kinect) opens up a new dimension of tackling the tasks of human action recognition.…”

Section: Related Work 21 Multi-modal Action Recognitionmentioning

confidence: 99%

“…Deep Neural Networks have shown superior and impressive performance on many tasks and even are the state-of-the-art methods in many real-world applications such as image classification [34], Poster Session D3: Multimedia Analysis and Description & Multimedia Fusion and Embedding MM '20, October 12-16, 2020, Seattle, WA, USA object detection [16,38], video action recognition [14,47,50,57], machine translation [3,25,35], and speech recognition [28]. However, these deep neural networks are notoriously well-known for their vulnerability [2,8].…”

Section: Introductionmentioning

confidence: 99%

Finding Achilles' Heel

Kumar

Seah

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Neural network-based models are notoriously known for their adversarial vulnerability. Recent adversarial machine learning mainly focused on images, where a small perturbation can be simply added to fool the learning model. Very recently, this practice has been explored in human action video attacks by adding perturbation to key frames. Unfortunately, frame selection is usually computationally expensive in run-time, and adding noises to all frames is unrealistic, either. In this paper, we present a novel yet efficient approach to address this issue. Multi-modal video data such as RGB, depth and skeleton data have been widely used for human action modeling, and they have been demonstrated with superior performance than a single modality. Interestingly, we observed that the skeleton data is more "vulnerable" under adversarial attack, and we propose to leverage this "Achilles' Heel" to attack multi-modal video data. In particular, first, an adversarial learning paradigm is designed to perturb skeleton data for a specific action under a black

show abstract

“…detection [2,12,22] etc., it is still difficult for the machine to understand video content in a fine-grained and structured level. To tackle this issue, visual relation is one of the most important and useful information that can help describe the dynamic interactions between the objects in a video.…”

Section: Introductionmentioning

confidence: 99%

Video Relation Detection via Multiple Hypothesis Association

Shang

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video visual relation detection (VidVRD) aims at obtaining not only the trajectories of objects but also the dynamic visual relations between them. It provides abundant information for video understanding and can serve as a bridge between vision and language. Compared with visual relation detection on image, VidVRD requires one more step at last called visual relation association which associates relation segments across time dimension into video relations. This step plays an important role in the task but is less studied. Nevertheless, visual relation association is a difficult task as the association process is easily affected by inaccurate tracklet detection and relation prediction in the former steps. In this paper, we propose a novel relation association method called Multiple Hypothesis Association (MHA). It maintains multiple possible relation hypothesis during the association process in order to tolerate and handle the inaccurate or missing problem in the former steps and generate more accurate video relations. Our experiments on the benchmark datasets (Imagenet-VidVRD and VidOR) show that our method outperforms the state-of-the-art methods.

show abstract

Long Short-Term Relation Networks for Video Action Detection

Cited by 22 publications

References 48 publications

Representing Videos as Discriminative Sub-graphs for Action Recognition

Representing Videos as Discriminative Sub-graphs for Action Recognition

Finding Achilles' Heel

Video Relation Detection via Multiple Hypothesis Association

Contact Info

Product

Resources

About