2017
DOI: 10.1016/j.cviu.2017.07.001
|View full text |Cite
|
Sign up to set email alerts
|

Simple to complex cross-modal learning to rank

Abstract: The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval. Some studies formalize the cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal embedding space to measure the cross-modality similarity. However, previous methods often establish the shared embedding space based on linear mapping functions which might not be sophisticated enough to reveal more complicated inter-modal correspondences. Additionally, current studies as… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
15
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 73 publications
(15 citation statements)
references
References 30 publications
0
15
0
Order By: Relevance
“…Huang et al [19] integrated quadruplet ranking loss and semi-supervised contrastive loss for modeling cross-modal semantic similarity in the deep model. Luo et al [20] proposed learning to rank with nonlinear mapping functions for cross-modal data by using the self-paced learning with diversity. Peng et al [21] adopted a two-stage learning based on deep neural network, in which intra-modal and inter-modal correlation are simultaneously modeled for feature learning and common representation learning stages, respectively.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Huang et al [19] integrated quadruplet ranking loss and semi-supervised contrastive loss for modeling cross-modal semantic similarity in the deep model. Luo et al [20] proposed learning to rank with nonlinear mapping functions for cross-modal data by using the self-paced learning with diversity. Peng et al [21] adopted a two-stage learning based on deep neural network, in which intra-modal and inter-modal correlation are simultaneously modeled for feature learning and common representation learning stages, respectively.…”
Section: Related Workmentioning
confidence: 99%
“…Along with deep learning, deep metric learning has been developed in many visual understanding tasks [17], such as face recognition, image classification, visual search, visual tracking, person re-identification, and multi-modal matching. In the multi-modal task, hierarchical nonlinear transformations of deep neural network (DNN) [18][19][20][21] are utilized to learn the common multi-modal representations, in which the parameters of DNN for different modalities are optimized by modeling the similar and dissimilar constraints to preserve cross-modal relative ranking information. These methods are still based on hand-crafted image features, which are not optimal for multi-modal retrieval task since the stage of feature extracting and the stage of common representation learning are separated.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…α = 0.5 ), while the weights α i are optimized dynamically in the proposed model. It is more sensible for the important modality to hold the dominant position in the optimization [13], [28].…”
Section: Regularized Binary Latent Modelmentioning
confidence: 99%
“…Developing a reliable object tracker is very important for intelligent video analysis, and it plays the key role in motion perception in videos (Chang et al (2017b,a); Chang and Yang (2017); Li et al (2017b); Ma et al (2018); Wang et al (2017Wang et al ( , 2016b; Luo et al (2017)). While significant progress in object tracking research has been made and many object tracking algorithms have been developed with promising performance (Ye et al (2015(Ye et al ( , 2016(Ye et al ( , 2017(Ye et al ( , 2018b; Zhou et al (2018b,a); Ye et al (2018a); Liu et al (2018); Lan et al (2018a); Zhang et al (2013bZhang et al ( , 2017dZhang et al ( ,c, 2018c; Song et al (2017Song et al ( , 2018; Zhang et al (2017bZhang et al ( , 2016Zhang et al ( , 2018a; Hou et al (2017); Yang et al (2016); Zhong et al (2014); Guo et al (2017); Ding et al (2018); Shao et al (2018); Yang et al (2018b,a); Pang et al (2017)), it is worth noting that most of these trackers are designed for tracking objects in RGB image sequences, in which they model the object's appearance via the visual features extracted from RGB video frames.…”
Section: Introductionmentioning
confidence: 99%