2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.347
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering

Abstract: We propose a high-level concept word detector that can be integrated with any video-to-language models. It takes a video as input and generates a list of concept words as useful semantic priors for language generation models. The proposed word detector has two important properties. First, it does not require any external knowledge sources for training. Second, the proposed word detector is trainable in an end-to-end manner jointly with any video-to-language models. To effectively exploit the detected words, we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
129
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 202 publications
(136 citation statements)
references
References 29 publications
0
129
0
Order By: Relevance
“…against prior work that directly uses MSR-VTT for training (reproduced in [63]) in Table 6. Our off-the-shelf HowTo100M model outperforms [22,24,53,64,65] that are directly trained on MSR-VTT. Here again, after fine-tuning the HowTo100M pre-trained model on MSR-VTT, we observe a significant improvement over the state-of-the-art JSFusion [63] trained on MSR-VTT.…”
Section: Comparison With State-of-the-artmentioning
confidence: 96%
“…against prior work that directly uses MSR-VTT for training (reproduced in [63]) in Table 6. Our off-the-shelf HowTo100M model outperforms [22,24,53,64,65] that are directly trained on MSR-VTT. Here again, after fine-tuning the HowTo100M pre-trained model on MSR-VTT, we observe a significant improvement over the state-of-the-art JSFusion [63] trained on MSR-VTT.…”
Section: Comparison With State-of-the-artmentioning
confidence: 96%
“…VideoQA is considered to be a challenging problem as reasoning on video clip usually requires memorizing contextual information in temporal scale. Many models have been proposed to tackle this problem [5,10,27,[30][31][32]. Many work [5,10,30] utilized both motion (i.e.…”
Section: Related Workmentioning
confidence: 99%
“…Recursive neural networks are investigated in [34] for vectorizing subject-verb-object triplets extracted from a given sentence. Variants of recurrent neural networks are being exploited, see the usage of LSTM, bidirectional LSTM, and Gated Recurrent Unit (GRU) in [37], [36], and [24], respectively. To the best of our knowledge, [7] is the only work looking to a joint use of multiple sentence encoding strategies including bag-of-words, word2vec and GRU.…”
Section: Related Workmentioning
confidence: 99%