Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning

Rahman, Tanzila; Xu, Bicheng; Sigal, Leonid

doi:10.1109/iccv.2019.00900

Cited by 73 publications

(52 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Table 2 presents the results of baseline methods and HCN. There are several baseline methods: (1) WLT (Rahman et al, 2019), a weakly supervised method with multi-modal input; (2) multi-modal video event captioning (MDVC) (Iashin and Rahtu, 2020b), a transformer-based model with multi-modal inputs;…”

Section: Compare With State-of-the-art Resultsmentioning

confidence: 99%

“…Multi-modal Video Captioning Video natu-rally has multi-modal inputs including visual, speech text, and audio. Previous works explore visual RGB, motion, optical flow features, audio features (Hori et al, 2017;Wang et al, 2018b;Rahman et al, 2019) as well as speech text features (Shi et al, 2019;Hessel et al, 2019;Iashin and Rahtu, 2020b) for captioning. According to the work in (Shi et al, 2019;Hessel et al, 2019;Iashin and Rahtu, 2020b), although the speech text is noisy and informal, it can still capture better semantic features and improve performance especially for instructional videos.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Hierarchical Context-aware Network for Dense Video Event Captioning

Ji¹,

Guo²,

Huang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the videolevel HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at https://github.com/ KirkGuo/HCN.

show abstract

Section: Compare With State-of-the-art Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Hierarchical Context-aware Network for Dense Video Event Captioning

Ji¹,

Guo²,

Huang³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…DCEV [24] 1.60 8.88 DVC [28] 1.71 9.31 Bi-SST [51] -10.89 HACA [52] 2.71 11.16 MWSDEC [38] 1.46 7.23 MDVC [19] 1.46 7.23 BMT [18] 1.99 10.90 MV-GPT (Ours) 6.84 12.31 Table 7. Comparison to SOTA on ActivityNet-Captions for video captioning with ground-truth action proposals.…”

Section: Methods B-4 Mmentioning

confidence: 99%

End-to-end Generative Pretraining for Multimodal Video Captioning

Seo¹,

Nagrani²,

Arnab³

et al. 2022

Preprint

View full text Add to dashboard Cite

Caption man talking about the difficulty in mentally processing hard math problems "This is an example of conscious level processing. This hurts, right? This takes more. This takes more energy. " …"You're going to put some oil in here and be generous with the oil because we want it to it helps cook it while it's in the oven " "We're simply gonna wrap this together and wrap it up in the foil." ! ! + 1 … Unlabelled video Video with annotated caption Future Utterance (a) Multimodal video captioning -dtrm5hxiII-00039 Caption man talking about the difficulty in mentally processing hard math problems "This is an example of conscious level processing. This hurts, right? This takes more. This takes more energy. "… "You're going to put some oil in here and be generous with the oil because we want it to it helps cook it while it's in the oven " "We're simply gonna wrap this together and wrap it up in the foil." … Unlabelled video Video with annotated captionFuture Utterance (b) Pretraining using a future utterance Figure 1. Generative pretraining for Multimodal Video Captioning. Multimodal Video Captioning takes visual frames and speech transcribed by ASR as inputs and predicts a caption. The example on the left (a) demonstrates that using both modalities jointly is beneficial to generate an accurate caption, i.e., red words are present in the visual input whereas blue words correspond to the concepts in the ASR. Our new multimodal video generative pretraining (MV-GPT) uses a future utterance in time from the video stream as a captioning target (b). This objective can be applied to unlabeled data (e.g., HowTo100M), which comes with ASR but no captions, and results in effective joint-pretraining for both a multimodal encoder and decoder.

show abstract

“…Õ», Nabati [203] Çªû½ ´¨ Ǒ µ , îÂÔü ï µ ¨ . Rahman [62] Â ªü ç µ Ç», Wu [219] AE ª Ù LSTM µ ñ , Å » Çª Ù ÞÈý (hierarchy memory decoder) [24,57,59] . ü 4 ñ Á µ ï , Çð BLEU [226] , ROUGE-L [227] ,…”

Section: ä¡ ¨ Aeâôüunclassified