Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval 2020
DOI: 10.1145/3397271.3401151
|View full text |Cite
|
Sign up to set email alerts
|

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Abstract: The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each oth… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 111 publications
(41 citation statements)
references
References 40 publications
0
41
0
Order By: Relevance
“…Experiments on a large-scale product image retrieval dataset verify the viability of our model for language-based product image retrieval, and the ablation study shows the importance of both multilevel representations and multi-granularity similarities. In the future, we would like to explore our model for general cross-modal retrieval tasks, such as text-to-video retrieval [27,28,29].…”
Section: Discussionmentioning
confidence: 99%
“…Experiments on a large-scale product image retrieval dataset verify the viability of our model for language-based product image retrieval, and the ablation study shows the importance of both multilevel representations and multi-granularity similarities. In the future, we would like to explore our model for general cross-modal retrieval tasks, such as text-to-video retrieval [27,28,29].…”
Section: Discussionmentioning
confidence: 99%
“…Recently, significant advances have been made in learning joint representations (e.g. [33,68,69]), but applying these approaches, which typically rely on a set of concept labels, to a topically very broad [4] video collection such as V3C1 is still Table 1. Selected search approaches integrated and frequently used in the participating systems, marked with a reference to the paper describing features/method or with ✓ ○ for a common/custom feature; V3C1 means meta-data provided with the V3C1 dataset [57].…”
Section: Overview Of Methods Integrated To Compared Systemsmentioning
confidence: 99%
“…We briefly review the representative methods of cross-modal textvideo retrieval, which follows a trend of learning a joint embedding space to measure the distance between textual and video representation. These methods roughly fall into two categories: 1) cross-modal interaction-free methods [9,10,12,21,28,[30][31][32]39] and 2) crossmodal interaction methods [7,13,24,34,40,41,44].…”
Section: Related Work 21 Text-video Retrievalmentioning
confidence: 99%
“…They also find that leveraging the contrastive loss can address visually misaligned narrations from uncurated instructional videos and improve video-text representations. Yang et al [39] present a base tree-augmented cross-modal encoding model, which designs a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Dong et al [9,10] adopt three branches, i.e., mean pooling, Bi-GRU and CNN, to encode sequential videos and texts and learn a hybrid common space for video-text similarity prediction.…”
Section: Related Work 21 Text-video Retrievalmentioning
confidence: 99%