Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413908
|View full text |Cite
|
Sign up to set email alerts
|

Controllable Video Captioning with an Exemplar Sentence

Abstract: In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. Formally, given a video and a syntactically valid exemplar sentence, the task aims to generate one caption which not only describes the semantic contents of the video, but also follows the syntactic form of the given exemplar sentence. In order to tackle such an exemplar-based video captioning task, we propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-de… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(6 citation statements)
references
References 45 publications
0
6
0
Order By: Relevance
“…Convolutional Neural Networks for video understanding have been extensively studied and widely applied to video-text pre-training [36], cross-modal analysis [29], video detection [19], ecommerce [6][7][8], adversarial attack [4,[53][54][55], interactive search [37,58], retrieval [17,18,26,57,62,64,66], hyperlinking [9,20,21,39], and caption [2,47,61], in the CNN era; we select and review representative 3D-CNNs as follows. C3D [48] is a pure 3D-CNN pilot based on a new 3D Conv operator and easily outperforms 2D counterparts on video tasks.…”
Section: Related Workmentioning
confidence: 99%
“…Convolutional Neural Networks for video understanding have been extensively studied and widely applied to video-text pre-training [36], cross-modal analysis [29], video detection [19], ecommerce [6][7][8], adversarial attack [4,[53][54][55], interactive search [37,58], retrieval [17,18,26,57,62,64,66], hyperlinking [9,20,21,39], and caption [2,47,61], in the CNN era; we select and review representative 3D-CNNs as follows. C3D [48] is a pure 3D-CNN pilot based on a new 3D Conv operator and easily outperforms 2D counterparts on video tasks.…”
Section: Related Workmentioning
confidence: 99%
“…According to the requirements of parallel training samples, existing solutions can be divided into two types: models using parallel stylized image-caption data [41,11,54,1] or not [22,42]. Subsequently, the community gradually shifts the emphasis to controlling described contents [16,77,27,10,78,48,35] or structures [20,19,75,76] [18,60,37,36,64], which aims to gen-erate discriminative and unique captions for individual images. Unfortunately, due to the subjective nature of diverse and distinctive captions, effective evaluation remains as an open problem, and several new metrics are proposed, such as SPICE-U [67], CIDErBtw [64], self-CIDEr [66], word recall [58], mBLEU [52].…”
Section: Related Workmentioning
confidence: 99%
“…Although retrieval-based methods can find human-like sentences with similar semantics to the video, it is challenging to generate an entirely correct description due to limited retrieval samples. With the advent of the encoder-decoder framework, most of the current work is studying how to better use visual features [35,23,1,36,22,24] and design elaborate models [34,2,28,13] to generate sentences directly. However, the diversity and controllability of sentences generated in this way are not satisfactory.…”
Section: Related Workmentioning
confidence: 99%
“…Video captioning is one of the most important visionlanguage tasks, and it seeks to automatically describe what has happened in the video according to the visual content. Recently, many promising methods [36,22,24,34,2] have been proposed to address this task. These methods mainly focus on learning the spatial-temporal representations of videos to fully tap visual information and devising novel decoders to achieve visual-textual alignment or controllable decoding.…”
Section: Introductionmentioning
confidence: 99%