I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs

Gao, Junyu; Zhang, Tianzhu; Xu, Changsheng

doi:10.1609/aaai.v33i01.33018303

Cited by 165 publications

(81 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 1 presents Phrase Detection rel=1 rel=10 rel=70 rel=1 rel=10 rel=70 Recall at 50 100 50 100 50 100 50 100 50 100 50 100 VTransE [34] 19. 4 2) Adopting a higher order structured output space ( = 40) outperforms lower order one ( = 150) which verifies the effectiveness of the HSA module.…”

Section: Ablation Studymentioning

confidence: 76%

“…In this way, the correlations between known classes and unknown classes can help to transfer the knowledge learned from the training classes to the unknown test classes by mapping the embeddings to visual classifiers. Knowledge Graphs (KGs) effectively capture explicit relational knowledge about individual entities hence many methods [4,6,9,25,33] use KGs to learn the class correlations. In scene graph generation, the relation classes are correlated by object classes as in the knowledge graph and the structural information is vital for a well-defined output space.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

Meng

Yuan

Yue

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Scene graph generation aims to produce structured representations for images, which requires to understand the relations between objects. Due to the continuous nature of deep neural networks, the prediction of scene graphs is divided into object detection and relation classification. However, the independent relation classes cannot separate the visual features well. Although some methods organize the visual features into graph structures and use message passing to learn contextual information, they still suffer from drastic intra-class variations and unbalanced data distributions. One important factor is that they learn an unstructured output space that ignores the inherent structures of scene graphs. Accordingly, in this paper, we propose a Higher Order Structure Embedded Network (HOSE-Net) to mitigate this issue. First, we propose a novel structure-aware embedding-to-classifier(SEC) module to incorporate both local and global structural information of relationships into the output space. Specifically, a set of context embeddings are learned via local graph based message passing and then mapped to a global structure based classification space. Second, since learning too many context-specific classification subspaces can suffer from data sparsity issues, we propose a hierarchical semantic aggregation(HSA) module to reduces the number of subspaces by introducing higher order structural information. HSA is also a fast and flexible tool to automatically search a semantic object hierarchy based on relational knowledge graphs. Extensive experiments show that the proposed HOSE-Net achieves the state-of-the-art performance on two popular benchmarks of Visual Genome and VRD.

show abstract

Section: Ablation Studymentioning

confidence: 76%

Section: Related Workmentioning

confidence: 99%

HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

Meng

Yuan

Yue

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Action recognition is the task of recognising the sequence of actions from the frames of a video. However, if the new actions are not available when training, Zero-shot learning can be a solution, such as in [45,107,124,149]. Zero-shot Style Transfer in an image is the problem of transferring the texture of source image to target image while the style is not pre-determined and it is arbitrary [151].…”

Section: Applicationsmentioning

confidence: 99%

Zero-Shot Learning and its Applications from Autonomous Vehicles to COVID-19 Diagnosis: A Review

Rezaei

Shahidi

2020

SSRN Journal

View full text Add to dashboard Cite

The challenge of learning a new concept, object, or a new medical disease recognition without receiving any examples beforehand is called Zero-Shot Learning (ZSL). One of the major issues in deep learning based methodologies such as in Medical Imaging and other real-world applications is the requirement of large annotated datasets prepared by clinicians or experts to train the model. ZSL is known for having minimal human intervention by relying only on previously known or trained concepts plus currently existing auxiliary information. This is ever-growing research for the cases where we have very limited or no annotated datasets available and the detection = recognition system has human-like characteristics in learning new concepts. This makes the ZSL applicable in many real-world scenarios, from unknown object detection in autonomous vehicles to medical imaging and unforeseen diseases such as COVID-19 Chest X-Ray (CXR) based diagnosis. In this review paper, we introduce a novel and broaden solution called Few = one-shot learning, and present the definition of the ZSL problem as an extreme case of the few-shot learning. We review over fundamentals and the challenging steps of Zero-Shot Learning, including state-of-the-art categories of solutions, as well as our recommended solution, motivations behind each approach, their advantages over each category to guide both clinicians and AI researchers to proceed with the best techniques and practices based on their applications. Inspired from different settings and extensions, we then review through different datasets inducing medical and non-medical images, the variety of splits, and the evaluation protocols proposed so far. Finally, we discuss the recent applications and future directions of ZSL. We aim to convey a useful intuition through this paper towards the goal of handling complex learning tasks more similar to the way humans learn. We mainly focus on two applications in the current modern yet challenging era: coping with an early and fast diagnosis of COVID-19 cases, and also encouraging the readers to develop other similar AI-based automated detection = recognition systems using ZSL.

show abstract

“…But these methods usually ignore the temporal information of videos, which take significant advantages for visual understanding [32]. [5] proposed a zero-shot action recognition framework using both the visual clues and external knowledge to show relations between objects and actions, also applied selfattention to model the temporal information of videos. Transfer learning In many real-world applications, it is expensive to re-collect training data and re-model when task changes [33].…”

Section: Related Workmentioning

confidence: 99%

Action Recognition in Untrimmed Videos with Composite Self-attention Two-Stream Framework

Cao¹,

Xu²,

Chen

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

With the rapid development of deep learning algorithms, action recognition in video has achieved many important research results. One issue in action recognition, Zero-Shot Action Recognition (ZSAR), has recently attracted considerable attention, which classify new categories without any positive examples. Another difficulty in action recognition is that untrimmed data may seriously affect model performance. We propose a composite two-stream framework with a pre-trained model. Our proposed framework includes a classifier branch and a composite feature branch. The graph network model is adopted in each of the two branches, which effectively improves the feature extraction and reasoning ability of the framework. In the composite feature branch, a 3-channel self-attention models are constructed to weight each frame in the video and give more attention to the key frames. Each selfattention models channel outputs a set of attention weights to focus on a particular aspect of the video, and a set of attention weights corresponds to a one-dimensional vector. The 3-channel self-attention models can evaluate key frames from multiple aspects, and the output sets of attention weight vectors form an attention matrix, which effectively enhances the attention of key frames with strong correlation of action. This model can implement action recognition under zero-shot conditions, and has good recognition performance for untrimmed video data. Experimental results on relevant data sets confirm the validity of our model.

show abstract

I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs

Cited by 165 publications

References 27 publications

HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation

Zero-Shot Learning and its Applications from Autonomous Vehicles to COVID-19 Diagnosis: A Review

Action Recognition in Untrimmed Videos with Composite Self-attention Two-Stream Framework

Contact Info

Product

Resources

About