2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.662
|View full text |Cite
|
Sign up to set email alerts
|

Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 65 publications
(34 citation statements)
references
References 24 publications
0
34
0
Order By: Relevance
“…Video captioning is a widely studied problem in computer vision [22,23,24,25]. Most approaches use a CNN pre-trained on image classification or action recognition to generate features [25,24,23]. These methods, like the video understanding methods described above, utilize a frame-based feature aggregation (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…Video captioning is a widely studied problem in computer vision [22,23,24,25]. Most approaches use a CNN pre-trained on image classification or action recognition to generate features [25,24,23]. These methods, like the video understanding methods described above, utilize a frame-based feature aggregation (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…The substantial difference between our model and the others assessed confirms that EtENet-IRv2 succeeds in achieving excellent results without requiring an overly complex structure, e.g., the addition of new layers as in RecNet (row 11, Table 2), or the adoption of new learning mechanisms such as reinforcement learning as in PickNet (row 3, Table 3). Moreover, this shows that it is possible to obtain excellent results even when using roughly half the frames used in other competing approaches [36,33,38,30]. Our framework sets a new standard in terms of top performances in video captioning and, we believe, can much contribute to further progress in the field.…”
Section: Discussionmentioning
confidence: 79%
“…Additionally, this is done without resorting to fancy 3D CNN architectures, thus leaving huge scope for further improvements. Moreover, unlike [38,30,15,36] which all use more than 25 frames per video clip, our model only uses 16 frames, a significant contribution in terms of memory and computational cost.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…MSR-VTT [57] is a recently released dataset. We compare performance of our approach on this dataset with the latest published models such as Alto [42], RUC-UVA [15], TDDF [61], PickNet [13], M 3 -VC [54] and RecNet local [52]. The results are summarized in Table 4.…”
Section: Results On Msr-vtt Datasetmentioning
confidence: 99%