Compositional Learning of Image-Text Query for Image Retrieval

Anwaar, Muhammad Umer; Labintcev, Egor; Kleinsteuber, Martin

doi:10.1109/wacv48630.2021.00118

Cited by 58 publications

(24 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proposed methodology shares bases with studies in [1], [25], and [47]. These bases include the query inputs, the use of neural networks, and testing results on the Fashion 200K dataset.…”

Section: Methodsmentioning

confidence: 99%

“…In [47], features composition using an autoencoder called ComposeAE to compose query muti-modals. Image features were extracted using CNN-ResNet17.…”

Section: A Comparative Study Compositional Methodsmentioning

confidence: 99%

“…The multilayer perceptron is used to concatenate text and image features in [40][41][42]. LSTM, as a recurrent model, fed by image features followed by text words is used in [25] [47]. Text used to form transfer matrices for image features is discussed in [26].…”

Section: Features Vectors Compositionmentioning

confidence: 99%

See 2 more Smart Citations

Neural Textual Features Composition for CBIR

2023

View full text Add to dashboard Cite

Content Based Image Retrieval, CBIR, is a highly active leading research field with numerous applications that are currently expanding beyond traditional CBIR methodologies. In this paper, a CBIR methodology is proposed to meet such demands. Query inputs of the proposed methodology are an image and a text. For instance, having an image, a user would like to obtain a similar one with some modification described in text format that we refer to as a text-modifier. The proposed methodology uses a set of neural networks that operate in feature space and perform feature composition in a uniform-known domain which is the textual feature domain. In this methodology, ResNet is used to extract image features and LSTM to extract text features to form query inputs. The proposed methodology uses a set of three single-hiddenlayer non-linear feedforward networks in a cascading structure labeled NetA, NetC, and NetB. NetA maps image features into corresponding textual features. NetC composes the textual features produced by NetA with text-modifier features to form target image textual features. NetB maps target textual features to target image features that are used to recall the target image from the image-base based on cosine similarity. The proposed architecture was tested using ResNet 18, 50 and 152 for extracting image features. The testing results are promising and can compete with the most recent approaches to our knowledge as listed in section 5.

show abstract

Section: Methodsmentioning

confidence: 99%

“…In [47], features composition using an autoencoder called ComposeAE to compose query muti-modals. Image features were extracted using CNN-ResNet17.…”

Section: A Comparative Study Compositional Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Neural Textual Features Composition for CBIR

2023

View full text Add to dashboard Cite

show abstract

“…PerVL arises in various scenarios. In image retrieval, a user may tag a few of their images and wish to retrieve other photos of that concept in a visual specific context (Chen et al, 2020;Anwaar et al, 2021); in human-robot interaction, a worker may show a specific tool to a robotic arm, and instruct how to use it (Wang et al, 2022;Lynch & Sermanet, 2020); in video security applications, an operator may search for one specific known item in the context of other items or people described using language.…”

Section: Introductionmentioning

confidence: 99%

"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

Cohen¹,

Gal²,

Meirom³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large Vision & Language models pretrained on web-scale data provide representations that are invaluable for numerous V&L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language. This problem arises in multiple domains, from personalized image retrieval to personalized interaction with smart devices. We introduce a new learning setup called Personalized Vision & Language (PerVL) with two new benchmark datasets for retrieving and segmenting user-specific ("personalized") concepts "in the wild". In PerVL, one should learn personalized concepts (1) independently of the downstream task (2) allowing a pretrained model to reason about them with free language, and (3) does not require personalized negative examples. We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts. The model can then reason about them by simply using them in a sentence. We demonstrate that our approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation using rich textual queries * .

show abstract

“…MIT_STATE [9] contains 63440 pictures and 245 types of objects, each type of object has an average of 9 adjectives to describe, these adjectives emphasize the state of the object and the state transformation between pictures of similar objects, such as "old" and "new". According to the citations, most of the ones exploiting this dataset are image retrieval models [28,29,30], which are used as benchmarks to compare with other models on R@k.…”

Section: Introductionmentioning

confidence: 99%

GSAIC: GeoScience Articles Illustration and Caption Dataset

Shi¹

2022

HSET

View full text Add to dashboard Cite

The scientific investigation of geoscience includes data collection, sample classification and semantic, consisting of a large number of images. An image-text search model that can well assist the research work of geoscience. However, the existing image-text datasets are mainly in the field of daily life and lack academic image-text datasets. In order to help geoscience researchers to investigate through the image and text, and to provide a new benchmark for researchers in the fields of data mining and information retrieval, this paper proposes a novel parallel material of geoscience academic illiustrateion and caption (GSAIC) based on GAKG, which contains over 900,000 illustrations of earth science papers and the corresponding captions. GSAIC filters out high-quality illustrations and captions through a classifier, and with the support of experts annotations. The GSAIC will support several tasks Including text search for images, retrieving corresponding images or papers based on academic image descriptions and academic illustration classification tasks, for geoscience scenarios Finally, both the GSAIC benchmark and classifier are publicly accessible.

show abstract

Compositional Learning of Image-Text Query for Image Retrieval

Cited by 58 publications

References 14 publications

Neural Textual Features Composition for CBIR

Neural Textual Features Composition for CBIR

"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

GSAIC: GeoScience Articles Illustration and Caption Dataset

Contact Info

Product

Resources

About