Rethinking the Evaluation of Video Summaries

Otani, Mayu; Nakashima, Yuta; Rahtu, Esa; Heikkilä, Janne

doi:10.1109/cvpr.2019.00778

Cited by 115 publications

(92 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our experiments on two benchmark datasets showed that our hierarchical structure achieved the best performance out of all other methods known to us. In particular, evaluation using rank order statistics that was recently proposed in [13] clearly showed the superiority of our proposed method. Also, our proposal only requires a smaller number of task-level annotations to train the Manager.…”

Section: Resultsmentioning

confidence: 64%

See 1 more Smart Citation

Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning

Chen

Wang

et al. 2019

Proceedings of the ACM Multimedia Asia

View full text Add to dashboard Cite

Conventional video summarization approaches based on reinforcement learning have the problem that the reward can only be received after the whole summary is generated. Such kind of reward is sparse and it makes reinforcement learning hard to converge. Another problem is that labelling each frame is tedious and costly, which usually prohibits the construction of large-scale datasets. To solve these problems, we propose a weakly supervised hierarchical reinforcement learning framework, which decomposes the whole task into several subtasks to enhance the summarization quality. This framework consists of a manager network and a worker network. For each subtask, the manager is trained to set a subgoal only by a task-level binary label, which requires much fewer labels than conventional approaches. With the guide of the subgoal, the worker predicts the importance scores for video frames in the subtask by policy gradient according to both global reward and innovative defined sub-rewards to overcome the sparse problem. Experiments on two benchmark datasets show that our proposal has achieved the best performance, even better than supervised approaches. CCS CONCEPTS• Computing methodologies → Video summarization; • Theory of computation → Reinforcement learning.

show abstract

Section: Resultsmentioning

confidence: 64%

“…However, our proposal was not capable of the transfer task because subgoals between different domains may vary a lot. Then, we evaluate the performance using the rank order statistics proposed in [13] introduced in Sec. 4.2, which is claimed to be a better evaluation metric without the effect of post-processing.…”

Section: Discussionmentioning

confidence: 99%

Weakly Supervised Video Summarization by Hierarchical Reinforcement Learning

Chen

Wang

et al. 2019

Proceedings of the ACM Multimedia Asia

View full text Add to dashboard Cite

show abstract

“…A related problem is the fact that current supervised techniques are trained using a 'combined' ground truth summary, either in form of combined scores from multiple ground truth summaries or scores [4,12,37] or in form a set of ground truth selections, as in dP-PLSTM [37]. However, since there can be multiple correct answers, a reason for low consistency between user summaries [13,23] combining them into one misses out on the separate flavors captured by each of them. Combining many into one set of scores also runs the risk of giving more emphasis to 'importance' over and above other desirable characteristics of a summary like continuity, diversity etc.…”

Section: Introductionmentioning

confidence: 99%

“…Evaluation: With a desire to be comparable across techniques, almost all recent work evaluates their results using F1 score [4,12,39]. This approach of assessing a candidate summary vis-avis a ground truth summary sounds good, but it has following limitations: 1) The user summaries are themselves inconsistent with each other, as already noted above [13,23]. As a workaround, the assessment is done with respect to the nearest neighbor [10,30].…”

Section: Introductionmentioning

confidence: 99%

“…The number of right answers can be large, especially in case of long videos and a good candidate may get a low score just because it was not fortunate to have a matching user summary. Furthermore, F1 seems to be good to measure the 'closeness' with a user summary, but the numbers can be deceptive as it is affected by the segmentation used as a post processing step in typical video summarization pipeline [23]. F1 is also not well suited to measure other characteristics of a summary like continuity or diversity.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Realistic Video Summarization through VISIOCITY

Kaushal

Kothawade²,

Iyer³

et al. 2020

Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery

View full text Add to dashboard Cite

Automatic video summarization is still an unsolved problem due to several challenges. We take steps towards making it more realistic by addressing the following challenges. Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type. We introduce a new benchmarking dataset called VISIOCITY which comprises of longer videos across six different categories with dense concept annotations capable of supporting different flavors of video summarization and other vision problems. Secondly, for long videos, human reference summaries, necessary for supervised video summarization techniques, are difficult to obtain. We present a novel recipe based on pareto optimality to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY. We show that these summaries are at par with human summaries. Thirdly, we demonstrate that in the presence of multiple ground truth summaries (due to the highly subjective nature of the task), learning from a single combined ground truth summary using a single loss function is not a good idea. We propose a simple recipe VISIOCITY-SUM to enhance an existing model using a combination of losses and demonstrate that it beats the current state of the art techniques. We also present a study of different desired characteristics of a good summary and demonstrate that a single measure (say F1) to evaluate a summary, as is the current typical practice, falls short in some ways. We propose an evaluation framework for better quantitative assessment of summary quality which is closer to human judgment than a single measure. We report the performance of a few representative techniques of video summarization on VISIOCITY assessed using various measures and bring out the limitation of the techniques and/or the assessment mechanism in modeling human judgment and demonstrate the effectiveness of our evaluation framework in doing so.

show abstract