Proceedings of the 27th ACM International Conference on Multimedia 2019
DOI: 10.1145/3343031.3351056
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks

Abstract: With the rapid growth of video data, video summarization technique plays a key role in reducing people's efforts to explore the content of videos by generating concise but informative summaries. Though supervised video summarization approaches have been well studied and achieved state-of-the-art performance, unsupervised methods are still highly demanded due to the intrinsic difficulty of obtaining high-quality annotations. In this paper, we propose a novel yet simple unsupervised video summarization method wi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
44
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 73 publications
(56 citation statements)
references
References 37 publications
(57 reference statements)
0
44
0
1
Order By: Relevance
“…[7] introduces a variation of [6] that replaces the VAE with an Attention Auto-Encoder for learning an attention-driven reconstruction of the original video that subsequently improves the key-fragment selection process. Similarly, [40] presents a self-attention-based conditional GAN to simultaneously minimize the distance between the generated and raw frame features, and focus on the most important fragments of the video. Finally, [41] learns video summarization from unpaired data based on an adversarial process and a FCSN, and defines a mapping function of a raw video to a human-like summary.…”
Section: B Unsupervised Video Summarizationmentioning
confidence: 99%
See 1 more Smart Citation
“…[7] introduces a variation of [6] that replaces the VAE with an Attention Auto-Encoder for learning an attention-driven reconstruction of the original video that subsequently improves the key-fragment selection process. Similarly, [40] presents a self-attention-based conditional GAN to simultaneously minimize the distance between the generated and raw frame features, and focus on the most important fragments of the video. Finally, [41] learns video summarization from unpaired data based on an adversarial process and a FCSN, and defines a mapping function of a raw video to a human-like summary.…”
Section: B Unsupervised Video Summarizationmentioning
confidence: 99%
“…In addition to the above reported findings, we illustrate the quality of the produced summaries by the proposed AC-SUM-GAN method with an example. For this, we use video #15 of the TVSum dataset (titled "How to Clean Your Dog's Ears -Vetoquinol USA") that is used for the same purpose in a few other SoA works (e.g., [4], [8], [9], [10], [39], [40]), and we compare the performance of the AC-SUM-GAN method against five other summarization methods with publicly-available implementations (these methods are, to our knowledge, the only ones for which implementations are publicly available). Fig.…”
Section: Qualitative Analysis -A Summarization Examplementioning
confidence: 99%
“…On the other hand [35], [36] use a combination of objectives like interestingness, uniformity, representativeness to identify the most appealing moments. Recent successes of Generative Adversarial Networks have led to several works based on unsupervised approaches for video summarization [37], [38]. Zhang et al [31] were the first ones using LSTM for video summarization, their method is a bidirectional LSTM followed by a Multi-Layer Perceptron.…”
Section: Related Workmentioning
confidence: 99%
“…The latter is estimated by quantifying the similarity between the original video and a reconstructed version of it based on the set of selected key-frames. The general approach of using Generative Adversarial Networks to estimate this similarity was first proposed in [18] and further extended by several other SoA video summarization algorithms (e.g., [11,12,29]) as a means to assess the representativeness of a set of key-frames that will be eventually used to generate a static (a.k.a. video storyboard) or dynamic video summary (a.k.a.…”
Section: Overviewmentioning
confidence: 99%