A hybrid graph-based and non-linear late fusion approach for multimedia retrieval

Liparas

et al. 2017

Multimed Tools Appl

Self Cite

Heterogeneous sources of information, such as images, videos, text and metadata are often used to describe different or complementary views of the same multimedia object, especially in the online news domain and in large annotated image collections. The retrieval of multimedia objects, given a multimodal query, requires the combination of several sources of information in an efficient and scalable way. Towards this direction, we provide a novel unsupervised framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual metadata, integrating non-linear graph-based fusion and Partial Least Squares Regression. The fusion strategy is based on the construction of a multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. Our framework can employ more than two modalities and high-level information, without increase in memory complexity, when compared to state-of-the-art baseline methods. The experimental comparison is done in three public multimedia collections in the multimedia retrieval task. The results have shown that the proposed method outperforms the baseline methods, in terms of Mean Average Precision and Precision@20.

Section: Introductionmentioning

confidence: 82%

Section: Multimedia Databasementioning

confidence: 99%

Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression

Liparas

et al. 2017

Multimed Tools Appl

Self Cite

“…1, with multimodal fusion of low-and high-level visual and textual information, color-based clustering, served by the VERGE Graphical User Interface (GUI). The overall system is novel, since it integrates the fusion of multiple modalities [4], in a hybrid graph-based and non-linear way [5], with several functionalities (eg. multimedia retrieval, image retrieval, search by visual or textual concept, etc.)…”

Section: Multimedia Retrieval Systemmentioning

confidence: 99%

“…In brief the multimedia retrieval module [5], constructs one similarity matrix per modality and one similarity vector (query based) per modality, given M modalities and a query, but only for the results of a text-based search, assuming that text description is the main semantic source of information [6]. A graph-based fusion of multiple modalities [4] is combined with all similarity vectors in a non-linear way [5], which in general may fuse multiple modalities. In this context, we employ M = 3 modalities, namely visual features (RGB-SIFT), locally aggregated into one vector representation using VLAD encoding (Section II.B.1), text description (Section II.C), 346 high-level visual concepts (Section II.B.2), and textual high-level concepts, which are DBpedia 2 entities.…”

Section: A Multimedia Retrieval Module Based On Multimodal Fusionmentioning

confidence: 99%

A multimedia interactive search engine based on graph-based and non-linear multimodal fusion

2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI)

Mironidis

et al. 2016

Self Cite

This paper presents an interactive multimedia search engine, which is capable of searching into multimedia collections by fusing textual and visual information. Apart from multimedia search, the engine is able to perform text search and image retrieval independently using both high-level and lowlevel information. The images of the multimedia collection are organized by color, offering fast browsing in the image collection.

“…The novelty of this work is to propose a two-layer fusion method of visual descriptors, visual concepts and color features for combining multiple and diverse queries, where temporal information is also exploited. In contradiction to [2], we use compound queries and we also exploit the temporal order of video shots. We propose an integrated unifying approach that combines graph-based fusion of similarities at feature level, with late fusion at decision level.…”

Section: Introductionmentioning

confidence: 99%

Fusion of Compound Queries with Multiple Modalities for Known Item Video Search

2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP)

Vrochidis

et al. 2018

Self Cite

Multimedia collections are ubiquitous and very often contain hundreds of hours of video information. The retrieval of a particular scene of a video (Known Item Search) in a large collection is a difficult problem, considering the multimodal character of all video shots and the complexity of the query, either visual or textual. We tackle these challenges by fusing, first, multiple modalities in a nonlinear graph-based way for each subtopic of the query. In addition, we fuse the top retrieved video shots per sub-query to provide the final list of retrieved shots, which is then re-ranked using temporal information. The framework is evaluated in popular Known Item Search tasks in the context of video shot retrieval and provides the largest Mean Reciprocal Rank scores.