2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00957
|View full text |Cite
|
Sign up to set email alerts
|

Dual Encoding for Zero-Example Video Retrieval

Abstract: This paper attacks the challenging problem of zeroexample video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
199
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 249 publications
(208 citation statements)
references
References 25 publications
1
199
0
Order By: Relevance
“…These learning-to-rank approaches have been generalised to two or more modalities. Standard examples include building a joint embedding for images and text [11,36], videos and audio [33] and, more related to our work, for videos and action labels [15], videos and text [8,14,40] or some of those combined [25,24,22]. Representing text.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…These learning-to-rank approaches have been generalised to two or more modalities. Standard examples include building a joint embedding for images and text [11,36], videos and audio [33] and, more related to our work, for videos and action labels [15], videos and text [8,14,40] or some of those combined [25,24,22]. Representing text.…”
Section: Related Workmentioning
confidence: 99%
“…Early works in image-to-text crossmodal retrieval [9,11,36] used TF-IDF as a weighted bagof-words model for text representations (either from a word embedding model or one-hot vectors) in order to aggregate variable length text captions into a single fixed sized representation. With the advent of neural networks, works shifted to use RNNs, Gated Recurrent Units (GRU) or Long Short-Term Memory (LSTM) units to extract textual features [8] or to use these models within the embedding network [15,18,24,25,34] for both modalities. Action embedding and retrieval.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…To retrieve a video with natural language queries, the main challenge is the gap between two different modals. Visual Semantic Embedding (VSE) [9,7], a widely adopted approach in video retrieval [38,18,37,6,35], tries to tackle this problem by embedding multi-modal information into a common space. JSF proposed in [37] learns matching kernels based on fea-ture sequence fusion.…”
Section: Related Workmentioning
confidence: 99%