Audiovisual Highlight Detection in Videos

Mundnich, Karel; Fenster, Alexandra; Khare, Aparna; Sundaram, Shiva

doi:10.1109/icassp39728.2021.9413394

Cited by 5 publications

(1 citation statement)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the 2048 dimension scene ResNet50 representations pre-trained 3 on the Places365 dataset [4]. Previous studies have shown the benefits of these representations [13,25]. These Places embeddings serve as input for subsequent the training process (keeping Module A frozen).…”

Section: Methodsmentioning

confidence: 99%

Scene Representation Learning from Videos Using Self-Supervised and Weakly-Supervised Techniques

Peri

Parthasarathy

Sundaram

2022

2022 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

Holistic understanding of videos requires the recognition of the overall scene beyond detecting foreground activity and objects. It provides valuable information for various video understanding tasks such as video summarization, scene change detection and content filtering. While significant effort has been put into developing models for scene classification in images (e.g. Places365), video-level scene recognition is relatively nascent. The scope of this paper is to address this problem of going from image representations to video for scene classification. In particular, we compare self-supervised deep learning methods on video scene recognition task using the HVU dataset.Starting from strong image level scene representations, with triplets based contrastive loss, we train a video-level scene classifier. We propose triplet sampling strategies that aid the self-supervision. We compare the self-supervised techniques against the image level scene representations, as well as a weakly supervised classifier trained on image labels. We observe that the models learned using selfsupervised method outperform both baselines (with statistical significance), showing that we are able to retain the representative power of the video-level scene representations compared to a competitive image-level scene recognition model trained on Places365, while showing benefits over weakly supervised techniques.

show abstract

Section: Methodsmentioning

confidence: 99%