Using Sentences as Semantic Representations in Large Scale Zero-Shot Learning

Cacheux, Yannick Le; Borgne, Hervé Le; Crucianu, Michel

doi:10.1007/978-3-030-66415-2_42

Cited by 7 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Incorporating language in the zero/fewshot setting has been widely explored. Embedding language from class names or descriptions to obtain class "prototypes" is common in zero-shot learning, when no visual samples of the class are available [7,8,17,31,32]. Several works also aim to learn classes using their semantic attributes for better knowledge transfer [16,25,45].…”

Section: Related Workmentioning

confidence: 99%

On Guiding Visual Attention with Language Specification

Petryk¹,

Dunlap²,

Nasseri³

et al. 2022

Preprint

View full text Add to dashboard Cite

While real world challenges typically define visual categories with language words or phrases, most visual classification methods define categories with numerical indices. However, the language specification of the classes provides an especially useful prior for biased and noisy datasets, where it can help disambiguate what features are taskrelevant. Recently, large-scale multimodal models have been shown to recognize a wide variety of high-level concepts from a language specification even without additional image training data, but they are often unable to distinguish classes for more fine-grained tasks. CNNs, in contrast, can extract subtle image features that are required for finegrained discrimination, but will overfit to any bias or noise in datasets. Our insight is to use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors. To do this, we ground task-relevant words or phrases with attention maps from a pretrained large-scale model. We then use this grounding to supervise a classifier's spatial attention away from distracting context. We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data, including ∼3−15% worst-group accuracy improvements and ∼41−45% relative improvements on fairness metrics.

show abstract

Section: Related Workmentioning

confidence: 99%

On Guiding Visual Attention with Language Specification

Petryk¹,

Dunlap²,

Nasseri³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Following [PG11], the authors of [LCPLB20] further suggested to address the problem of bulk tagging [OM13] -users attributing the exact same tags to numerous photos -by ensuring that a tuple of words (wi, wj) can only appear once for each user during training, thus preventing a single user from having a disproportionate weight on the final embedding. Also, [LCLBC20] suggested to exploit the sentence descriptions of WordNet concepts, in addition to the class name embedding, to produce semantic representations better reflecting visual relations. Any of these two proposals allow to reach an accuracy between 17.2 and 17.8 on the 500 test classes of the ImageNet ZSL benchmark with the linear model from the semantic to the visual space (Section 3.1), compared to 14.4 with semantic prototypes based on standard embeddings.…”

Section: Semantic Features For Large Scale Zslmentioning

confidence: 99%

Zero-shot Learning with Deep Neural Networks for Object Recognition

Cacheux¹,

Borgne²,

Crucianu³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Zero-shot learning deals with the ability to recognize objects without any visual training sample. To counterbalance this lack of visual data, each class to recognize is associated with a semantic prototype that reflects the essential features of the object. The general approach is to learn a mapping from visual data to semantic prototypes, then use it at inference to classify visual samples from the class prototypes only. Different settings of this general configuration can be considered depending on the use case of interest, in particular whether one only wants to classify objects that have not been employed to learn the mapping or whether one can use unlabelled visual examples to learn the mapping. This chapter presents a review of the approaches based on deep neural networks to tackle the ZSL problem. We highlight findings that had a large impact on the evolution of this domain and list its current challenges.

show abstract

“…Ablation III: Effect of sentence embeddings for semantic matching. Sentence embeddings have recently been shown to be beneficial for zero-shot recognition in the image domain [22]. Here, we investigate their potential in the video domain.…”

Section: Ablation Studies On Object-scene Compositionsmentioning

confidence: 99%

Zero-Shot Action Recognition from Diverse Object-Scene Compositions

Carlo¹,

Mettes²

2021

Preprint

View full text Add to dashboard Cite

This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available. For this challenging scenario, the current leading approach is to transfer knowledge from the image domain by recognizing objects in videos using pre-trained networks, followed by a semantic matching between objects and actions. Where objects provide a local view on the content in videos, in this work we also seek to include a global view of the scene in which actions occur. We find that scenes on their own are also capable of recognizing unseen actions, albeit more marginally than objects, and a direct combination of object-based and scene-based scores degrades the action recognition performance. To get the best out of objects and scenes, we propose to construct them as a Cartesian product of all possible compositions. We outline how to determine the likelihood of object-scene compositions in videos, as well as a semantic matching from object-scene compositions to actions that enforces diversity among the most relevant compositions for each action. While simple, our compositionbased approach outperforms object-based approaches and even state-of-the-art zero-shot approaches that rely on large-scale video datasets with hundreds of seen actions for training and knowledge transfer.

show abstract

Using Sentences as Semantic Representations in Large Scale Zero-Shot Learning

Cited by 7 publications

References 5 publications

On Guiding Visual Attention with Language Specification

On Guiding Visual Attention with Language Specification

Zero-shot Learning with Deep Neural Networks for Object Recognition

Zero-Shot Action Recognition from Diverse Object-Scene Compositions

Contact Info

Product

Resources

About