Dual Encoding for Zero-Example Video Retrieval

Dong, Jianfeng; Li, Xirong; Xu, Chaoxi; Ji, Shouling; He, Yuan; Yang, Gang; Wang, Xun

doi:10.1109/cvpr.2019.00957

Cited by 249 publications

(208 citation statements)

References 25 publications

Supporting

Mentioning

199

Contrasting

Order By: Relevance

“…These learning-to-rank approaches have been generalised to two or more modalities. Standard examples include building a joint embedding for images and text [11,36], videos and audio [33] and, more related to our work, for videos and action labels [15], videos and text [8,14,40] or some of those combined [25,24,22]. Representing text.…”

Section: Related Workmentioning

confidence: 99%

“…Early works in image-to-text crossmodal retrieval [9,11,36] used TF-IDF as a weighted bagof-words model for text representations (either from a word embedding model or one-hot vectors) in order to aggregate variable length text captions into a single fixed sized representation. With the advent of neural networks, works shifted to use RNNs, Gated Recurrent Units (GRU) or Long Short-Term Memory (LSTM) units to extract textual features [8] or to use these models within the embedding network [15,18,24,25,34] for both modalities. Action embedding and retrieval.…”

Section: Related Workmentioning

confidence: 99%

“…We select MSR-VTT [39] as a public dataset for general video retrieval. Originally used for video captioning, this large-scale video understanding dataset is increasingly evaluated for video-to-text and text-to-video retrieval [8,22,24,41,23]. We follow the code and setup of [22] using the same train/test split that includes 7,656 training videos each with 20 different captions describing the scene and 1000 test videos with one caption per video.…”

Section: General Video Retrieval On Msr-vttmentioning

confidence: 99%

“…This approach has a number of advantages over training a single embedding space as is standardly done [7,8,15,22,24]. Firstly, this process builds different embeddings that can be seen as different views of the data, which contribute to the final goal in a collaborative manner.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Wray

Csurka

Larlus

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

136

106

View full text Add to dashboard Cite

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities.We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: General Video Retrieval On Msr-vttmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Wray

Csurka

Larlus

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

136

106

View full text Add to dashboard Cite

show abstract

“…To retrieve a video with natural language queries, the main challenge is the gap between two different modals. Visual Semantic Embedding (VSE) [9,7], a widely adopted approach in video retrieval [38,18,37,6,35], tries to tackle this problem by embedding multi-modal information into a common space. JSF proposed in [37] learns matching kernels based on fea-ture sequence fusion.…”

Section: Related Workmentioning

confidence: 99%

A Graph-Based Framework to Bridge Movies and Synopses

Xiong

Huang

Guo

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Inspired by the remarkable advances in video analytics, research teams are stepping towards a greater ambition -movie understanding. However, compared to those activity videos in conventional datasets, movies are significantly different. Generally, movies are much longer and consist of much richer temporal structures. More importantly, the interactions among characters play a central role in expressing the underlying story. To facilitate the efforts along this direction, we construct a dataset called Movie Synopses Associations (MSA) over 327 movies, which provides a synopsis for each movie, together with annotated associations between synopsis paragraphs and movie segments. On top of this dataset, we develop a framework to perform matching between movie segments and synopsis paragraphs. This framework integrates different aspects of a movie, including event dynamics and character interactions, and allows them to be matched with parsed paragraphs, based on a graph-based formulation. Our study shows that the proposed framework remarkably improves the matching accuracy over conventional featurebased methods. It also reveals the importance of narrative structures and character interactions in movie understanding. Dataset and code are available at: https:// ycxioooong.github.io/projects/moviesyn 1

show abstract

VERGE in VBS 2020

Andreadis¹,

Moumtzidou²,

Apostolidis³

et al. 2019

MultiMedia Modeling

View full text Add to dashboard Cite

This paper demonstrates VERGE, an interactive video retrieval engine for browsing a collection of images or videos and searching for specific content. The engine integrates a multitude of retrieval methodologies that include visual and textual searches and further capabilities such as fusion and reranking. All search options and results appear in a web application that aims at a friendly user experience.

show abstract

Dual Encoding for Zero-Example Video Retrieval

Cited by 249 publications

References 25 publications

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

A Graph-Based Framework to Bridge Movies and Synopses

VERGE in VBS 2020

Contact Info

Product

Resources

About