Variational Stacked Local Attention Networks for Diverse Video Captioning

Deb, Tonmoay; Sadmanee, Akib; Bhaumik, Kishor Kumar; Ali, Amin Ahsan; Amin, M. Ashraful; Rahman, A K M Mahbubur

doi:10.1109/wacv51458.2022.00255

Cited by 11 publications

(6 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate the effectiveness of SCG-SP, we compare our model with SOTA methods for DivVC, i.e., Div-BS (Vijayakumar et al 2018), SeqCVAE (Aneja et al 2019), COSCVAE , DML (Chen, Deng, and Wu 2022), STR (Liu et al 2022), and VSLAN (Deb et al 2022). Note that Div-BS, SeqCVAE, COSCVAE, and DML are re-implemented based on corresponding DivIC methods.…”

Section: Performance Comparison With Sotamentioning

confidence: 99%

“…Mainstream diverse captioning methods can be classified into two categories: conditional variational encoder (CVAE) based methods (Aneja et al 2019;Chen et al 2019;Deb et al 2022;Jain, Zhang, and Schwing 2017;Liu et al 2022; Figure 1: Difference between (a) existing CVAE/controlbased diverse captioning methods and (b) our proposed SCG-SP. There is no direct interaction among generated captions in CVAE-based or control-based methods, where the loss is calculated with independent training samples.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

Lu,

Zhang,

Yuan

et al. 2024

AAAI

View full text Add to dashboard Cite

Diverse video captioning aims to generate a set of sentences to describe the given video in various aspects. Mainstream methods are trained with independent pairs of a video and a caption from its ground-truth set without exploiting the intra-set relationship, resulting in low diversity of generated captions. Different from them, we formulate diverse captioning into a semantic-concept-guided set prediction (SCG-SP) problem by fitting the predicted caption set to the ground-truth set, where the set-level relationship is fully captured. Specifically, our set prediction consists of two synergistic tasks, i.e., caption generation and an auxiliary task of concept combination prediction providing extra semantic supervision. Each caption in the set is attached to a concept combination indicating the primary semantic content of the caption and facilitating element alignment in set prediction. Furthermore, we apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions with various concept combinations. These two tasks share multiple semantics-specific encodings as input, which are obtained by iterative interaction between visual features and conceptual queries. The correspondence between the generated captions and specific concept combinations further guarantees the interpretability of our model. Extensive experiments on benchmark datasets show that the proposed SCG-SP achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.

show abstract

Section: Performance Comparison With Sotamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

Lu,

Zhang,

Yuan

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“… Sudhakaran, Escalera & Lanz (2021) draw design inspiration from bilinear processing of Lin, RoyChowdhury & Maji (2015) and MCB to propose ‘Class Activation Pooling’ for video action recognition. Deb et al (2022) use MLB to process video features for video captioning.…”

Section: Related Workmentioning

confidence: 99%

“…Furthermore, despite BLP’s history of success in text-image fusion in VQA, it has not yet gained such notoriety in video-QA. Though BLP methods have continued to perform well on video tasks when fusing vision and non-textual features ( Hu et al, 2021 ; Zhou et al, 2021 ; Pang et al, 2021 ; Xu et al, 2021 ; Deng et al, 2021 ; Wang, Bao & Xu, 2021 ; Deb et al, 2022 ; Sudhakaran, Escalera & Lanz, 2021 ), BLP has recently been overshadowed by other vision and textual feature fusion techniques in video-QA ( Kim et al, 2019 ; Li et al, 2019 ; Gao et al, 2019 ; Liu et al, 2021 ; Liang et al, 2019 ). In this paper, we aim to add a new perspective to the empirical and motivational drift in BLP.…”

Section: Introductionmentioning

confidence: 99%

Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels

Winterbottom

Xiao

McLean³

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

Bilinear pooling (BLP) refers to a family of operations recently developed for fusing features from different modalities predominantly for visual question answering (VQA) models. Successive BLP techniques have yielded higher performance with lower computational expense, yet at the same time they have drifted further from the original motivational justification of bilinear models, instead becoming empirically motivated by task performance. Furthermore, despite significant success in text-image fusion in VQA, BLP has not yet gained such notoriety in video question answering (video-QA). Though BLP methods have continued to perform well on video tasks when fusing vision and non-textual features, BLP has recently been overshadowed by other vision and textual feature fusion techniques in video-QA. We aim to add a new perspective to the empirical and motivational drift in BLP. We take a step back and discuss the motivational origins of BLP, highlighting the often-overlooked parallels to neurological theories (Dual Coding Theory and The Two-Stream Model of Vision). We seek to carefully and experimentally ascertain the empirical strengths and limitations of BLP as a multimodal text-vision fusion technique in video-QA using two models (TVQA baseline and heterogeneous-memory-enchanced ‘HME’ model) and four datasets (TVQA, TGif-QA, MSVD-QA, and EgoVQA). We examine the impact of both simply replacing feature concatenation in the existing models with BLP, and a modified version of the TVQA baseline to accommodate BLP that we name the ‘dual-stream’ model. We find that our relatively simple integration of BLP does not increase, and mostly harms, performance on these video-QA benchmarks. Using our insights on recent work in BLP for video-QA results and recently proposed theoretical multimodal fusion taxonomies, we offer insight into why BLP-driven performance gain for video-QA benchmarks may be more difficult to achieve than in earlier VQA models. We share our perspective on, and suggest solutions for, the key issues we identify with BLP techniques for multimodal fusion in video-QA. We look beyond the empirical justification of BLP techniques and propose both alternatives and improvements to multimodal fusion by drawing neurological inspiration from Dual Coding Theory and the Two-Stream Model of Vision. We qualitatively highlight the potential for neurological inspirations in video-QA by identifying the relative abundance of psycholinguistically ‘concrete’ words in the vocabularies for each of the text components (e.g., questions and answers) of the four video-QA datasets we experiment with.

show abstract

“…has not yet gained such notoriety in video-QA. Though BLP methods have continued to perform well on video tasks when fusing vision and non-textual features Hu et al (2021); Zhou et al (2021); Pang et al (2021); Xu et al (2021); Deng et al (2021); Wang et al (2021); Deb et al (2022); Sudhakaran et al (2021), BLP has recently been overshadowed by other vision and textual feature fusion techniques in video-QA Kim et al (2019); ; Gao et al (2019); Liu et al (2021); Liang et al (2019).…”

Section: Introduction 39mentioning

confidence: 99%

Peer Review #1 of "Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels (v0.1)"

2022

View full text Add to dashboard Cite

Variational Stacked Local Attention Networks for Diverse Video Captioning

Cited by 11 publications

References 27 publications

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning

Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels

Peer Review #1 of "Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels (v0.1)"

Contact Info

Product

Resources

About