2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00900
|View full text |Cite
|
Sign up to set email alerts
|

Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning

Abstract: Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audiovisual correlations in service of sound or sound source localization. In this paper, we present the evidence, that audio signals … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
41
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 73 publications
(52 citation statements)
references
References 43 publications
0
41
0
1
Order By: Relevance
“…Table 2 presents the results of baseline methods and HCN. There are several baseline methods: (1) WLT (Rahman et al, 2019), a weakly supervised method with multi-modal input; (2) multi-modal video event captioning (MDVC) (Iashin and Rahtu, 2020b), a transformer-based model with multi-modal inputs;…”
Section: Compare With State-of-the-art Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Table 2 presents the results of baseline methods and HCN. There are several baseline methods: (1) WLT (Rahman et al, 2019), a weakly supervised method with multi-modal input; (2) multi-modal video event captioning (MDVC) (Iashin and Rahtu, 2020b), a transformer-based model with multi-modal inputs;…”
Section: Compare With State-of-the-art Resultsmentioning
confidence: 99%
“…Multi-modal Video Captioning Video natu-rally has multi-modal inputs including visual, speech text, and audio. Previous works explore visual RGB, motion, optical flow features, audio features (Hori et al, 2017;Wang et al, 2018b;Rahman et al, 2019) as well as speech text features (Shi et al, 2019;Hessel et al, 2019;Iashin and Rahtu, 2020b) for captioning. According to the work in (Shi et al, 2019;Hessel et al, 2019;Iashin and Rahtu, 2020b), although the speech text is noisy and informal, it can still capture better semantic features and improve performance especially for instructional videos.…”
Section: Related Workmentioning
confidence: 99%
“…DCEV [24] 1.60 8.88 DVC [28] 1.71 9.31 Bi-SST [51] -10.89 HACA [52] 2.71 11.16 MWSDEC [38] 1.46 7.23 MDVC [19] 1.46 7.23 BMT [18] 1.99 10.90 MV-GPT (Ours) 6.84 12.31 Table 7. Comparison to SOTA on ActivityNet-Captions for video captioning with ground-truth action proposals.…”
Section: Methods B-4 Mmentioning
confidence: 99%
“…Õ», Nabati [203] Ǫû½ ´¨ Ǒ µ , îÂÔü ï µ ¨ . Rahman [62]  ªü ç µ Ç», Wu [219] AE ª Ù LSTM µ ñ , Å » Ǫ Ù ÞÈý (hierarchy memory decoder) [24,57,59] . ü 4 ñ Á µ ï , Çð BLEU [226] , ROUGE-L [227] ,…”
Section: ä¡ ¨ Aeâôüunclassified