2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022
DOI: 10.1109/wacv51458.2022.00255
|View full text |Cite
|
Sign up to set email alerts
|

Variational Stacked Local Attention Networks for Diverse Video Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(6 citation statements)
references
References 27 publications
0
6
0
Order By: Relevance
“…To evaluate the effectiveness of SCG-SP, we compare our model with SOTA methods for DivVC, i.e., Div-BS (Vijayakumar et al 2018), SeqCVAE (Aneja et al 2019), COSCVAE , DML (Chen, Deng, and Wu 2022), STR (Liu et al 2022), and VSLAN (Deb et al 2022). Note that Div-BS, SeqCVAE, COSCVAE, and DML are re-implemented based on corresponding DivIC methods.…”
Section: Performance Comparison With Sotamentioning
confidence: 99%
See 1 more Smart Citation
“…To evaluate the effectiveness of SCG-SP, we compare our model with SOTA methods for DivVC, i.e., Div-BS (Vijayakumar et al 2018), SeqCVAE (Aneja et al 2019), COSCVAE , DML (Chen, Deng, and Wu 2022), STR (Liu et al 2022), and VSLAN (Deb et al 2022). Note that Div-BS, SeqCVAE, COSCVAE, and DML are re-implemented based on corresponding DivIC methods.…”
Section: Performance Comparison With Sotamentioning
confidence: 99%
“…Mainstream diverse captioning methods can be classified into two categories: conditional variational encoder (CVAE) based methods (Aneja et al 2019;Chen et al 2019;Deb et al 2022;Jain, Zhang, and Schwing 2017;Liu et al 2022; Figure 1: Difference between (a) existing CVAE/controlbased diverse captioning methods and (b) our proposed SCG-SP. There is no direct interaction among generated captions in CVAE-based or control-based methods, where the loss is calculated with independent training samples.…”
Section: Introductionmentioning
confidence: 99%
“… Sudhakaran, Escalera & Lanz (2021) draw design inspiration from bilinear processing of Lin, RoyChowdhury & Maji (2015) and MCB to propose ‘Class Activation Pooling’ for video action recognition. Deb et al (2022) use MLB to process video features for video captioning.…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, despite BLP’s history of success in text-image fusion in VQA, it has not yet gained such notoriety in video-QA. Though BLP methods have continued to perform well on video tasks when fusing vision and non-textual features ( Hu et al, 2021 ; Zhou et al, 2021 ; Pang et al, 2021 ; Xu et al, 2021 ; Deng et al, 2021 ; Wang, Bao & Xu, 2021 ; Deb et al, 2022 ; Sudhakaran, Escalera & Lanz, 2021 ), BLP has recently been overshadowed by other vision and textual feature fusion techniques in video-QA ( Kim et al, 2019 ; Li et al, 2019 ; Gao et al, 2019 ; Liu et al, 2021 ; Liang et al, 2019 ). In this paper, we aim to add a new perspective to the empirical and motivational drift in BLP.…”
Section: Introductionmentioning
confidence: 99%
“…has not yet gained such notoriety in video-QA. Though BLP methods have continued to perform well on video tasks when fusing vision and non-textual features Hu et al (2021); Zhou et al (2021); Pang et al (2021); Xu et al (2021); Deng et al (2021); Wang et al (2021); Deb et al (2022); Sudhakaran et al (2021), BLP has recently been overshadowed by other vision and textual feature fusion techniques in video-QA Kim et al (2019); ; Gao et al (2019); Liu et al (2021); Liang et al (2019).…”
Section: Introduction 39mentioning
confidence: 99%