Explore Video Clip Order With Self-Supervised and Curriculum Learning for Video Applications

Wang

IEEE Trans. on Image Process.

et al. 2022

108

Attempt to fully explore the fine-grained temporal structure and global-local chronological characteristics for selfsupervised video representation learning, this work takes a closer look at exploiting the temporal structure of videos and further proposes a novel self-supervised method named Temporal Contrastive Graph (TCG). In contrast to the existing methods that randomly shuffle the video frames or video snippets within a video, our proposed TCG roots in a hybrid graph contrastive learning strategy to regard the intersnippet and intra-snippet temporal relationships as selfsupervision signals for temporal representation learning. To increase the temporal diversity of features more comprehensively and precisely, our proposed TCG integrates the prior knowledge about the frame and snippet orders into temporal contrastive graph structures, i.e., the intra-/inter-snippet temporal contrastive graph modules. By randomly removing edges and masking node features of the intra-snippet graphs or inter-snippet graphs, our TCG can generate different correlated graph views. Then, specific contrastive losses are designed to maximize the agreement between node embeddings in different views. To learn the global context representation and recalibrate the channelwise features adaptively, we introduce an adaptive video snippet order prediction module, which leverages the relational knowledge among video snippets to predict the actual snippet orders.Experimental results demonstrate the superiority of our TCG over the state-of-the-art methods on largescale action recognition and video retrieval benchmarks.

Section: Sample and Shufflementioning

confidence: 99%

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning

Wang

IEEE Trans. on Image Process.

et al. 2022

108

“…3. First, we follow previous video representation learning works [48,47] and perform late feature fusion, where we extract features from the input shot clips and then perform a hierarchical fusion of features. The combined features are then passed to a classifier network to predict the order (see Fig.…”

Section: Shot Sequence Orderingmentioning

confidence: 99%

The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing

Argaw¹,

Heilbron²,

Lee³

et al. 2022

Preprint

Machine learning is transforming the video editing industry. Recent advances in computer vision have leveled-up video editing tasks such as intelligent reframing, rotoscoping, color grading, or applying digital makeups. However, most of the solutions have focused on video manipulation and VFX. This work introduces the Anatomy of Video Editing, a dataset, and benchmark, to foster research in AI-assisted video editing. Our benchmark suite focuses on video editing tasks, beyond visual effects, such as automatic footage organization and assisted video assembling. To enable research on these fronts, we annotate more than 1.5M tags, with relevant concepts to cinematography, from 196176 shots sampled from movie scenes. We establish competitive baseline methods and detailed analyses for each of the tasks. We hope our work sparks innovative research towards underexplored areas of AI-assisted video editing. Code is available at: https://github.com/dawitmureja/AVE.git.

Proceedings of the 29th ACM International Conference on Multimedia

“…Such a human-like learning strategy considerably improves the model's capability. Curriculum learning has been widely used in the computer vision area [13,23,48]. With the continuous expansion of transformer in computer vision, we propose to use a curriculum learning strategy aiming at improving our Dual-GCN model's performance on image captioning task.…”

Section: Related Workmentioning

confidence: 99%

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Dong

Long²,

Xu³

et al. 2021

Self Cite

Existing image captioning methods just focus on understanding the relationship between objects or instances in a single image, without exploring the contextual correlation existed among contextual image. In this paper, we propose Dual Graph Convolutional Networks (Dual-GCN) with transformer and curriculum learning for image captioning. In particular, we not only use an object-level GCN to capture the object to object spatial relation within a single image, but also adopt an image-level GCN to capture the feature information provided by similar images. With the well-designed Dual-GCN, we can make the linguistic transformer better understand the relationship between different objects in a single image and make full use of similar images as auxiliary information to generate a reasonable caption description for a single image. Meanwhile, with a cross-review strategy introduced to determine difficulty levels, we adopt curriculum learning as the training strategy to increase the robustness and generalization of our proposed model. We conduct extensive experiments on the large-scale MS COCO dataset, and the experimental results powerfully demonstrate that our proposed method outperforms recent state-of-the-art approaches. It achieves a BLEU-1 score of 82.2 and a BLEU-2 score of 67.6. Our source code is available at https:// github.com/ Unbear430/ DGCN-for-image-captioning. CCS CONCEPTS• Computing methodologies → Natural language processing; Scene understanding.