Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Hao, Yanbin; Wang, Shuo; Cao, Ping; Gao, Xiao‐Zhi; Xu, Tong; Wu, Jinmeng; He, Xiangnan

doi:10.1109/tcsvt.2022.3169842

Cited by 32 publications

(7 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Earlier approaches for video retrieval mainly revolve around code books [2,20,22] and hashing functions [32,33] for encoding a video into a low-dimensional representation. Fueled by the success of deep learning [6,10,25,27,28,41] in recent years, the predominant approaches are to decompose the video into frames and feed them into an image extraction backbone network, obtaining a sequence of image feature representations. One approach is to fuse all these image features into a single video-level representation and perform similar video pair detection on video-level representations [21,23,24].…”

Section: Video Retrievalmentioning

confidence: 99%

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Liu,

Yu,

Wang

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features.In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. We expect the sub-features to encode non-overlapping semantics of the original feature and remove redundant information. (2) On top of the disentangled sub-features, we further learn an auxiliary feature to enhance the sub-features. We theoretically analyzed the mutual information between the label and the disentangled features, arriving

show abstract

Section: Video Retrievalmentioning

confidence: 99%

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Liu,

Yu,

Wang

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…This method uses a global context pooling mechanism to enhance the spatially informative channels and was verified to be effective in image understanding tasks. A recent work by Hao et al [ 26 ] studied the insertion of channel context into the spatio-temporal attention learning block for element-wise feature refinement.…”

Section: Related Workmentioning

confidence: 99%

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Joefrie

Aono

2022

Entropy

View full text Add to dashboard Cite

Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply “META”), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of “Val Top-1 %” measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset.

show abstract

“…The Stand-alone Inter-Frame Attention [62] is an attention mechanism that operates across multiple frames, computing local self-attention for every spatial position. Hao et al [63] proposes an effective attention-in-attention technique for enhancing element-wise features, exploring the possibility of integrating channel context into the spatio-temporal attention learning module. Visual attention network [64] uses a large kernel attention to support the establishment of self-adaptive and extended-range correlations of self-attention.…”

Section: Attention Mechanismmentioning

confidence: 99%

Towards Efficient Video-Based Action Recognition: Context-Aware Memory Attention Network

2022

View full text Add to dashboard Cite

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Cited by 32 publications

References 49 publications

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation

Towards Efficient Video-Based Action Recognition: Context-Aware Memory Attention Network

Contact Info

Product

Resources

About