Video captioning with boundary-aware hierarchical language decoding and joint video prediction

Shi, Xiangxi; Cai, Jianfei; Gu, Jiuxiang; Joty, Shafiq

doi:10.1016/j.neucom.2020.08.035

Cited by 15 publications

(3 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hierarchical encoder structures are also proposed by [30] and [31] that gives more attention to the temporal details of the video. Descriptions are made by employing attention in the decoder section [15], [16] as well as multimodal fusion mechanisms with aural features in the video [32]. A multimodal temporal attention mechanism incorporating image, motion, and audio features is given in [33].…”

Section: Literature Surveymentioning

confidence: 99%

See 1 more Smart Citation

A Multimodal Framework for Video Caption Generation

Bhooshan

Kumar

2022

IEEE Access

View full text Add to dashboard Cite

Video captioning is a highly challenging computer vision task that automatically describes the video clips using natural language sentences with a clear understanding of the embedded semantics. In this work, a video caption generation framework consisting of discrete wavelet convolutional neural architecture along with multimodal feature attention is proposed. Here global, contextual and temporal features in the video frames are taken into account and separate attention networks are integrated in the visual attention predictor network to capture multiple attentions from these features. These attended features with textual attention are employed in the visual-to-text translator for caption generation. The experiments are conducted on two benchmark video captioning datasets -MSVD and MSR-VTT. The results prove an improved performance of the method with a CIDEr score of 91.7 and 52.2, for the aforementioned datasets, respectively.

show abstract

Section: Literature Surveymentioning

confidence: 99%

“…Later, attention mechanisms are included in the spatial as well as temporal domain to achieve better performance [13], [14]. Video descriptions can also be generated by employing attention in the decoder section as well as using multimodal fusion mechanisms of visual, text and audio features [15], [16].…”

Section: Introductionmentioning

confidence: 99%

A Multimodal Framework for Video Caption Generation

Bhooshan

Kumar

2022

IEEE Access

View full text Add to dashboard Cite

show abstract

“…State-of-the-art and limitations: The pursuit of multimodalinput-based abstractive text summarization can be related to various other fields, such as image and video captioning [22,34,39,48,49], video story generation [16], video title generation [57], and multimodal sentence summarization [28]. However, these works generally produce summaries based on either images or short videos, and the target summaries are easier to predict due to the limited vocabulary diversity.…”

Section: Introductionmentioning

confidence: 99%

An abstractive text summarization technique using transformer model with self-attention mechanism

Kumar

Solanki

2023

Neural Comput & Applic

View full text Add to dashboard Cite

The realm of scientific text summarization has experienced remarkable progress due to the availability of annotated brief summaries and ample data. However, the utilization of multiple input modalities, such as videos and audio, has yet to be thoroughly explored. At present, scientific multimodal-input-based text summarization systems tend to employ longer target summaries like abstracts, leading to an underwhelming performance in the task of text summarization.In this paper, we deal with a novel task of extreme abstractive text summarization (aka TL;DR generation) by leveraging multiple input modalities. To this end, we introduce mTLDR, a first-of-itskind dataset for the aforementioned task, comprising videos, audio, and text, along with both author-composed summaries and expertannotated summaries. The mTLDR dataset accompanies a total of 4, 182 instances collected from various academic conference proceedings, such as ICLR, ACL, and CVPR. Subsequently, we present mTLDRgen, an encoder-decoder-based model that employs a novel dual-fused hyper-complex Transformer combined with a Wasserstein Riemannian Encoder Transformer, to dexterously capture the intricacies between different modalities in a hyper-complex latent geometric space. The hyper-complex Transformer captures the intrinsic properties between the modalities, while the Wasserstein Riemannian Encoder Transformer captures the latent structure of the modalities in the latent space geometry, thereby enabling the model to produce diverse sentences. mTLDRgen outperforms 20 baselines on mTLDR as well as another non-scientific dataset (How2) across three Rouge-based evaluation measures. Furthermore, based on the qualitative metrics, BERTScore and FEQA, and human evaluations, we demonstrate that the summaries generated by mTLDRgen are fluent and congruent to the original source material. CCS CONCEPTS• >Computing methodologies → Natural language processing.

show abstract