2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI) 2016
DOI: 10.1109/cbmi.2016.7500252
|View full text |Cite
|
Sign up to set email alerts
|

A hybrid graph-based and non-linear late fusion approach for multimedia retrieval

Abstract: Nowadays, multimedia retrieval has become a task of high importance, due to the need for efficient and fast access to very large and heterogeneous multimedia collections. An interesting challenge within the aforementioned task is the efficient combination of different modalities in a multimedia object and especially the fusion between textual and visual information. The fusion of multiple modalities for retrieval in an unsupervised way has been mostly based on early, weighted linear, graph-based and diffusion-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2016
2016
2019
2019

Publication Types

Select...
4
3
1

Relationship

5
3

Authors

Journals

citations
Cited by 13 publications
(15 citation statements)
references
References 24 publications
0
15
0
Order By: Relevance
“…Towards this direction, we provide a novel framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual concepts. Our method extends our previous work [8] using Partial Least Squares (PLS) Regression to combine multiple views of the same modality, such as SIFT descriptors and visual features based on Deep Convolution Neural Networks. The proposed method is motivated by the PLS approach [24], due to its effectiveness in multimodal hashing, and is compared to several baseline methods in unsupervised multimedia retrieval, such as weighted linear, non-linear, diffusion-based and advanced graph-based models.…”
Section: Introductionmentioning
confidence: 82%
See 1 more Smart Citation
“…Towards this direction, we provide a novel framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual concepts. Our method extends our previous work [8] using Partial Least Squares (PLS) Regression to combine multiple views of the same modality, such as SIFT descriptors and visual features based on Deep Convolution Neural Networks. The proposed method is motivated by the PLS approach [24], due to its effectiveness in multimodal hashing, and is compared to several baseline methods in unsupervised multimedia retrieval, such as weighted linear, non-linear, diffusion-based and advanced graph-based models.…”
Section: Introductionmentioning
confidence: 82%
“…The need to extend the model of Equation (10) to multiple modalities has been highlighted in [9] and the non-linear graph-based fusion approach of Equation (16) has been presented in [8] and has been integrated in multimedia search engines [20]. In this context, we further elaborate our non-linear graph-based fusion of M modalities, combining also Partial Least Squares (PLS) Regression in the overall multimedia retrieval framework.…”
Section: Multimedia Databasementioning
confidence: 99%
“…1, with multimodal fusion of low-and high-level visual and textual information, color-based clustering, served by the VERGE Graphical User Interface (GUI). The overall system is novel, since it integrates the fusion of multiple modalities [4], in a hybrid graph-based and non-linear way [5], with several functionalities (eg. multimedia retrieval, image retrieval, search by visual or textual concept, etc.)…”
Section: Multimedia Retrieval Systemmentioning
confidence: 99%
“…In brief the multimedia retrieval module [5], constructs one similarity matrix per modality and one similarity vector (query based) per modality, given M modalities and a query, but only for the results of a text-based search, assuming that text description is the main semantic source of information [6]. A graph-based fusion of multiple modalities [4] is combined with all similarity vectors in a non-linear way [5], which in general may fuse multiple modalities. In this context, we employ M = 3 modalities, namely visual features (RGB-SIFT), locally aggregated into one vector representation using VLAD encoding (Section II.B.1), text description (Section II.C), 346 high-level visual concepts (Section II.B.2), and textual high-level concepts, which are DBpedia 2 entities.…”
Section: A Multimedia Retrieval Module Based On Multimodal Fusionmentioning
confidence: 99%
“…The novelty of this work is to propose a two-layer fusion method of visual descriptors, visual concepts and color features for combining multiple and diverse queries, where temporal information is also exploited. In contradiction to [2], we use compound queries and we also exploit the temporal order of video shots. We propose an integrated unifying approach that combines graph-based fusion of similarities at feature level, with late fusion at decision level.…”
Section: Introductionmentioning
confidence: 99%