Query and Keyframe Representations for Ad-hoc Video Search

Markatopoulou, Foteini; Galanopoulos, Damianos; Mezaris, Vasileios; Patras, Ioannis

doi:10.1145/3078971.3079041

Cited by 47 publications

(30 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concept based methods [18,24,25,31,41] mainly rely on establishing cross-modal associations via concepts [12]. Markatopoulou et al [24,25] first utilized relatively complex linguistic rules to extract relevant concepts from a given query and used pre-trained CNNs to detect the objects and scenes in video frames. Then the similarity between a given query and a specific video is measured by concept matching.…”

Section: Related Workmentioning

confidence: 99%

“…Existing efforts on video retrieval with complex queries can be roughly categorized into two groups: 1) Concept-based paradigm [18,24,25,31,41,52,53], as shown in Figure 1 (a). It usually uses a large set of visual concepts to describe the video content, then transforms the text query into a set of primitive concepts, and finally performs video retrieval by aggregating the matching results from different concepts [53].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Yang

Dong

Cao

et al. 2020

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

110

View full text Add to dashboard Cite

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries. To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval; Video search.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Yang

Dong

Cao

et al. 2020

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

110

View full text Add to dashboard Cite

show abstract

“…The video is decomposed into elementary temporal segments (shots) with the method of Apostolidis et al 13 Then, each shot is annotated with high-level visual concepts coming from the same pre-specified concept pool used for describing the lecture videos. This pool comprises the 346 concepts defined in the TRECVID SIN task (as in Markatopoulou et al 14 ), but is easily extendible to additional concepts for which training data are available (e.g., ImageNet). We use state-of-the-art deep-learning techniques such as Deep Convolutional Neural Network (DCNN) architectures.…”

Section: Video Processingmentioning

confidence: 99%

“…This means finding which non-lecture videos are most closely related to a given lecture video. This is realized in a direct analogy to how Markatopoulou et al 14 use semantic word embeddings to match the concept-based representations of textual queries and videos for performing video retrieval.…”

Section: Video Processingmentioning

confidence: 99%

Open Innovation in the Big Data Era With the MOVING Platform

Vagliano

Günther²,

Heinz³

et al. 2018

IEEE MultiMedia

Self Cite

View full text Add to dashboard Cite

“…This module formulates and expands an input query in order to translate it into a set of high-level concepts C Q , as proposed in [9]. First, we search for one or more high-level concepts that are semantically similar to the entire query, using the Explicit Semantic Analysis (ESA) measure [10].…”

Section: Automatic Query Formulation and Expansion Using High-level Cmentioning

confidence: 99%