Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Yang, Xun; Dong, Jianfeng; Cao, Yixin; Wang, Xun; Wang, Meng; Chua, Tat-Seng

doi:10.1145/3397271.3401151

Cited by 111 publications

(41 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Experiments on a large-scale product image retrieval dataset verify the viability of our model for language-based product image retrieval, and the ablation study shows the importance of both multilevel representations and multi-granularity similarities. In the future, we would like to explore our model for general cross-modal retrieval tasks, such as text-to-video retrieval [27,28,29].…”

Section: Discussionmentioning

confidence: 99%

Hierarchical Similarity Learning for Language-Based Product Image Retrieval

Liu

Dong

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper aims for the language-based product image retrieval task. The majority of previous works have made significant progress by designing network structure, similarity measurement, and loss function. However, they typically perform vision-text matching at certain granularity regardless of the intrinsic multiple granularities of images. In this paper, we focus on the cross-modal similarity measurement, and propose a novel Hierarchical Similarity Learning (HSL) network. HSL first learns multi-level representations of input data by stacked encoders, and object-granularity similarity and imagegranularity similarity are computed at each level. All the similarities are combined as the final hierarchical cross-modal similarity. Experiments on a large-scale product retrieval dataset demonstrate the effectiveness of our proposed method. Code and data are available at https://github.com/liufh1/hsl.

show abstract

Section: Discussionmentioning

confidence: 99%

Hierarchical Similarity Learning for Language-Based Product Image Retrieval

Liu

Dong

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recently, significant advances have been made in learning joint representations (e.g. [33,68,69]), but applying these approaches, which typically rely on a set of concept labels, to a topically very broad [4] video collection such as V3C1 is still Table 1. Selected search approaches integrated and frequently used in the participating systems, marked with a reference to the paper describing features/method or with ✓ ○ for a common/custom feature; V3C1 means meta-data provided with the V3C1 dataset [57].…”

Section: Overview Of Methods Integrated To Compared Systemsmentioning

confidence: 99%

Is the Reign of Interactive Search Eternal? Findings from the Video Browser Showdown 2020

Lokoč

Veselý

Mejzlík

et al. 2021

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Comprehensive and fair performance evaluation of information retrieval systems represents an essential task for the current information age. Whereas Cranfield-based evaluations with benchmark datasets support development of retrieval models, significant evaluation efforts are required also for user-oriented systems that try to boost performance with an interactive search approach. This article presents findings from the 9th Video Browser Showdown, a competition that focuses on a legitimate comparison of interactive search systems designed for challenging known-item search tasks over a large video collection. During previous installments of the competition, the interactive nature of participating systems was a key feature to satisfy known-item search needs, and this article continues to support this hypothesis. Despite the fact that top-performing systems integrate the most recent deep learning models into their retrieval process, interactive searching remains a necessary component of successful strategies for known-item search tasks. Alongside the description of competition settings, evaluated tasks, participating teams, and overall results, this article presents a detailed analysis of query logs collected by the top three performing systems, SOMHunter, VIRET, and vitrivr. The analysis provides a quantitative insight to the observed performance of the systems and constitutes a new baseline methodology for future events. The results reveal that the top two systems mostly relied on temporal queries before a correct frame was identified. An interaction log analysis complements the result log findings and points to the importance of result set and video browsing approaches. Finally, various outlooks are discussed in order to improve the Video Browser Showdown challenge in the future.

show abstract

“…We briefly review the representative methods of cross-modal textvideo retrieval, which follows a trend of learning a joint embedding space to measure the distance between textual and video representation. These methods roughly fall into two categories: 1) cross-modal interaction-free methods [9,10,12,21,28,[30][31][32]39] and 2) crossmodal interaction methods [7,13,24,34,40,41,44].…”

Section: Related Work 21 Text-video Retrievalmentioning

confidence: 99%

“…They also find that leveraging the contrastive loss can address visually misaligned narrations from uncurated instructional videos and improve video-text representations. Yang et al [39] present a base tree-augmented cross-modal encoding model, which designs a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Dong et al [9,10] adopt three branches, i.e., mean pooling, Bi-GRU and CNN, to encode sequential videos and texts and learn a hybrid common space for video-text similarity prediction.…”

Section: Related Work 21 Text-video Retrievalmentioning

confidence: 99%

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Han

Chen

Xiao

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level textual and video representation. The fine-grained interactions between video segments and phrases are usually neglected in cross-modal learning, which results in suboptimal retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal Alignment Network (FCA-Net), which considers the interactions between visual semantic units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal alignment. Specifically, the interactions between visual semantic units and phrases are formulated as a link prediction problem optimized by a graph autoencoder to obtain the explicit relations between them and enhance the aligned feature representation for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2, and VA-TEX datasets demonstrate the superiority of our model as compared to the state-of-the-art method. CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval.

show abstract

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Cited by 111 publications

References 40 publications

Hierarchical Similarity Learning for Language-Based Product Image Retrieval

Hierarchical Similarity Learning for Language-Based Product Image Retrieval

Is the Reign of Interactive Search Eternal? Findings from the Video Browser Showdown 2020

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval

Contact Info

Product

Resources

About