Attentive and Adversarial Learning for Video Summarization

Fu, Tsu-Jui; Tai, Shao-Heng; Chen, Hwann-Tzong

doi:10.1109/wacv.2019.00173

Cited by 63 publications

(55 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Experiments on three public datasets SumMe, TVSum and YouTube demonstrate the effectiveness of our proposed framework. In future work, we will continue to investigate this line of research by utilizing reinforcement learning algorithm (Fu et al, 2019), attention mechanism (Ji et al, 2019) and multi-stage learning (Huang et al, 2019) within the DTR-GAN framework to further improve generic video summarization.…”

Section: Resultsmentioning

confidence: 99%

Dilated temporal relational adversarial network for generic video summarization

Zhang

Kampffmeyer

Liang

et al. 2019

Multimed Tools Appl

View full text Add to dashboard Cite

The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, 2 Yujia Zhang et al.which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.

show abstract

Section: Resultsmentioning

confidence: 99%

Dilated temporal relational adversarial network for generic video summarization

Zhang

Kampffmeyer

Liang

et al. 2019

Multimed Tools Appl

View full text Add to dashboard Cite

show abstract

“…[13] formulates video summarization as a sequence-to-sequence learning problem and proposes an LSTMbased encoder-decoder network with an intermediate attention layer. In [9], the typical encoder-decoder seq2seq model is replaced by a special attention-based seq2seq model that defines and ranks the different fragments of the video, and is combined with a 3D-CNN classifier which judges whether a fragment is from a ground-truth or a generated summary. [8] introduces an architecture with memory augmented networks for global attention modeling, and tackles video summarization by estimating the temporal dependency across the entire video.…”

Section: Related Workmentioning

confidence: 99%

“…The contributions of our work are: i) the introduction of an attention mechanism in an unsupervised learning framework, whereas all previous attentionbased summarization methods ([7-9, 13]) were supervised; ii) the investigation of integrating attention into a variational auto-encoder for video summarization purposes; and iii) the use of attention to guide the generative adversarial training of the model, rather than using it to rank the video fragments as in [9].…”

Section: Related Workmentioning

confidence: 99%

“…For fair comparison with approaches evaluated using the single ground-truth summaries of each video of SumMe and TVSum (i.e. the different evaluation protocol adopted in [9,13,16,24,25,28]), we assessed our model also via this approach. Once again we considered different values for the regularization factor σ, to examine its impact on the model's efficiency according to this evaluation protocol and make our findings comparable with the ones in [16].…”

Section: Summe Tvsummentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

Apostolidis

Adamantidou

Metsai

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper presents a new video summarization approach that integrates an attention mechanism to identify the significant parts of the video, and is trained unsupervisingly via generative adversarial learning. Starting from the SUM-GAN model, we first develop an improved version of it (called SUM-GAN-sl) that has a significantly reduced number of learned parameters, performs incremental training of the model's components, and applies a stepwise label-based strategy for updating the adversarial part. Subsequently, we introduce an attention mechanism to SUM-GAN-sl in two ways: i) by integrating an attention layer within the variational auto-encoder (VAE) of the architecture (SUM-GAN-VAAE), and ii) by replacing the VAE with a deterministic attention auto-encoder (SUM-GAN-AAE). Experimental evaluation on two datasets (SumMe and TVSum) documents the contribution of the attention auto-encoder to faster and more stable training of the model, resulting in a significant performance improvement with respect to the original model and demonstrating the competitiveness of the proposed SUM-GAN-AAE against the state of the art.

show abstract

“…Hyperlapse techniques sample frames adaptively by searching the optimal configuration (e.g., shortest path in a graph or dynamic programming) in a representation space where different features are combined to represent frames or frame transitions. Although recent works achieved better results applying a large number of features to represent the data [31]- [33], it increases both the computation time and memory usage since it leads to a high-dimensional space in optimization problems. We address this representation problem using a sparse frame sampling approach as depicted in Fig.…”

Section: B Weighted Sparse Frame Samplingmentioning

confidence: 99%

Semantic Hyperlapse: a Sparse Coding-based and Multi-Importance Approach for First-Person Videos

Silva

Campos

2019

Anais Estendidos Do XXXII Conference on Graphics, Patterns and Images (SIBRAPI Estendido 2019)

View full text Add to dashboard Cite

The availability of low-cost, high-quality personal wearable cameras combined with the unlimited storage capacity of video-sharing websites has evoked a growing interest in First-Person Videos (FPVs). Such videos are usually composed of long-running unedited streams captured by a device attached to the user body, which makes them tedious and visually unpleasant to watch. Consequently, there is a rise in the need to provide quick access to the information therein. To address this need, efforts have been applied to the development of techniques such as Hyperlapse and Semantic Hyperlapse, which aims to create visually pleasant shorter videos and emphasize semantic portions of the video, respectively. The state-of-the-art Semantic Hyperlapse method SSFF, negligees the level of importance of the relevant information, by only evaluating if it is significant or not. Other limitations of SSFF are the number of input parameters, the scalability in the number of visual features to describe the frames, and the abrupt change in the speed-up rate of consecutive video segments. In this dissertation, we propose a parameter-free Sparse Coding based methodology to adaptively fast-forward First-Person Videos, that emphasize the semantic portions applying a multi-importance approach. Experimental evaluations show that the proposed method creates shorter version video retaining more semantic information, with fewer abrupt transitions of speed-up rates, and more stable final videos than the output of SSFF. Visual results and graphical explanation of the methodology can be visualized through the link: https://youtu.be/8uStih8P5-Y.

show abstract

Attentive and Adversarial Learning for Video Summarization

Cited by 63 publications

References 21 publications

Dilated temporal relational adversarial network for generic video summarization

Dilated temporal relational adversarial network for generic video summarization

Unsupervised Video Summarization via Attention-Driven Adversarial Learning

Semantic Hyperlapse: a Sparse Coding-based and Multi-Importance Approach for First-Person Videos

Contact Info

Product

Resources

About