Abstract. Automatic image annotation aims at predicting a set of textual labels for an image that describe its semantics. These are usually taken from an annotation vocabulary of few hundred labels. Because of the large vocabulary, there is a high variance in the number of images corresponding to different labels ("class-imbalance"). Additionally, due to the limitations of manual annotation, a significant number of available images are not annotated with all the relevant labels ("weaklabelling"). These two issues badly affect the performance of most of the existing image annotation models. In this work, we propose 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, that addresses these two issues in the image annotation task. The first step of 2PKNN uses "image-to-label" similarities, while the second step uses "image-to-image" similarities; thus combining the benefits of both. Since the performance of nearest-neighbour based methods greatly depends on how features are compared, we also propose a metric learning framework over 2PKNN that learns weights for multiple features as well as distances together. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label prediction. For scalability, we implement it by alternating between stochastic sub-gradient descent and projection steps. Extensive experiments demonstrate that, though conceptually simple, 2PKNN alone performs comparable to the current state-of-the-art on three challenging image annotation datasets, and shows significant improvements after metric learning.
The notion of relative attributes as introduced by Parikh and Grauman (ICCV, 2011) provides an appealing way of comparing two images based on their visual properties (or attributes) such as "smiling" for face images, "naturalness" for outdoor images, etc. For learning such attributes, a Ranking SVM based formulation was proposed that uses globally represented pairs of annotated images. In this paper, we extend this idea towards learning relative attributes using local parts that are shared across categories. First, instead of using a global representation, we introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part we associate a locally adaptive "significancecoefficient" that represents its discriminative ability with respect to a particular attribute. For each attribute, the significance-coefficients are learned simultaneously with a max-margin ranking model in an iterative manner. Compared to the baseline method, the new method is shown to achieve significant improvement in relative attribute prediction accuracy. Additionally, it is also shown to improve relative feedback based interactive image search.
Training TestingFigure 1: While training, given a dataset consisting of pairs of images and corresponding texts (here captions), we learn models for the two tasks (Im2Text and Text2Im) using a joint image-text representation. While testing for Im2Text, given a query image, we perform retrieval on a collection of only textual samples using the learned model. Similarly, for Text2Im, given a query text, retrieval is performed on a database consisting only of images.To automatically describe image content using text is one of the challenging and interesting research problems in computer vision. A complementary problem to this is to automatically associate semantically relevant image(s) given a piece of text, and is commonly referred as the image retrieval task. In this work, we address the problem of learning bilateral associations between visual and textual data. We study two complementary tasks: (i) predicting text(s) given an image ("Im2Text"), and (ii) predicting image(s) given a piece of text ("Text2Im"). While several existing methods (e.g., [1]) assume presence of data from both the modalities during the testing phase, the motivation of this work is similar to the few known works (e.g., [2]) that do not make such assumption. This means that for Im2Text, given a query image, our method retrieves a ranked list of semantically relevant texts from a plain text-corpus that has no associated images. Similarly, for Text2Im, given a query text, it retrieves a ranked list of images from an independent image collection without any associated textual meta-data. The major contributions of this work are: (1) We propose a novel Structural SVM based unified framework for both these tasks. We use vector representations for both visual (image) and textual data that are based on probability distributions over latent topics. From these, we form a joint feature vector using tensor product of input and output representations. Because the output data is represented in the form of a vector, we use Manhattan (M) and Euclidean (E) distances as our loss functions. As the proposed approach performs the two complementary tasks (Im2Text and Text2Im) under a single unified framework, we refer to it as Bilateral Image-Text Retrieval (or BITR). Figure 1 explains the gist of our framework.(2) We examine generalization of different methods across datasets when textual data is in the form of captions. For this, we learn models from one dataset, and perform retrieval on other. To our best knowledge, ours is the first such study in this domain. We conduct experiments on three datasets (UIUC Pascal Sentence dataset, IAPR TC-12 benchmark, and SBU-Captioned Photo dataset), and compare our approach with WSABIE [3] and CCA. These are two well-known methods that can scale to large datasets and have been shown to work well for learning cross-modal associations. While CCA based methods have been used previously under such settings [2], WSABIE was originally proposed for the task of label-ranking and hence can not be directly applied for captions. We do ...
We address the problem of automatic image annotation in large vocabulary datasets. In such datasets, for a given label, there could be several other labels that act as its confusing labels. Three possible factors for this are (i) incomplete-labeling ("cars" vs. "vehicle"), (ii) label-ambiguity ("flowers" vs. "blooms"), and (iii) structural-overlap ("lion" vs. "tiger"). While previous studies in this domain have mostly focused on nearest-neighbour based models, we show that even the conventional one-vs-rest SVM significantly outperforms several benchmark models. We also demonstrate that with a simple modification in the hinge-loss of SVM, it is possible to significantly improve its performance. In particular, we introduce a tolerance-parameter in the hinge-loss. This makes the new model more tolerant against the errors in the classification of samples tagged with confusing labels as compared to other samples. This tolerance parameter is automatically determined using visual similarity and dataset statistics. Experimental evaluations demonstrate that our method (referred to as SVM with Variable Tolerance or SVM-VT) shows promising results on the task of image annotation on three challenging datasets, and establishes a baseline for such models in this domain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.