Goal-Oriented Gaze Estimation for Zero-Shot Learning

Liu, Yang; Zhou, Lei; Bai, Xiao; Huang, Yifei; Gu, Lin; Zhou, Jun; Harada, Tatsuya

doi:10.1109/cvpr46437.2021.00379

Cited by 81 publications

(45 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, the holistic visual features are limited to poor transferable from one domain to another domain (e.g., from seen to unseen classes) [45], [46]. More relevant to this work are the recent attentionbased ZSL methods [28], [30], [31], [32], [47] that utilize attribute descriptions as guidance to discover the more discriminative region (or part) features. Unfortunately, They simply learn region embeddings (e.g., the whole bird body) neglecting the importance of discriminative attribute localization (e.g., the distinctive bird body parts).…”

Section: Zero-shot Learningmentioning

confidence: 99%

TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning

Chen¹,

Hong²,

Xie³

et al. 2021

Preprint

View full text Add to dashboard Cite

Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is typically represented by attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant and sufficient visual-semantic interaction for advancing ZSL. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferability and discriminative attribute localization of visual features. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations in ZSL. TransZero++ consists of an attribute→visual Transformer sub-net (AVT) and a visual→attribute Transformer sub-net (VAT). Specifically, AVT first takes a feature augmentation encoder to alleviate the cross-dataset bias between ImageNet and ZSL benchmarks, and improves the transferability of visual features by reducing the entangled relative geometry relationships among region features. Then, an attribute→visual decoder is employed to localize the image regions most relevant to each attribute in a given image for attribute-based visual feature representations. Analogously, VAT uses the similar feature augmentation encoder to refine the visual features, which are further applied in visual→attribute decoder to learn visual-based attribute features. By further introducing feature-level and prediction-level semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings via semantical collaborative learning. Finally, the semantic-augmented visual embeddings learned by AVT and VAT are fused to conduct desirable visual-semantic interaction cooperated with semantic vectors for ZSL classification. Extensive experiments show that TransZero++ achieves the new state-of-the-art results on three golden and challenging ZSL benchmarks. The codes are available at: https://github.com/shiming-chen/TransZero_pp.

show abstract

Section: Zero-shot Learningmentioning

confidence: 99%

TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning

Chen¹,

Hong²,

Xie³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, these methods still usually yield relatively undesirable results, since they cannot efficiently capture the subtle differences between seen and unseen classes. More relevant to this work are the recent attention-based ZSL methods (Xie et al 2019(Xie et al , 2020Zhu et al 2019;Xu et al 2020;Liu et al 2021) that utilize attribute descriptions as guidance to discover the more discriminative region (or part) features. Unfortunately, They simply learn region embeddings (e.g., the whole bird body) neglecting the importance of discriminative attribute localization (e.g., the distinctive bird body parts).…”

Section: Related Workmentioning

confidence: 99%

TransZero: Attribute-guided Transformer for Zero-Shot Learning

Chen¹,

Hong²,

Liu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is learned from attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant visual-semantic interaction. Although some attention-based models have attempted to learn such region features in a single image, the transferability and discriminative attribute localization of visual features are typically neglected. In this paper, we propose an attribute-guided Transformer network, termed Tran-sZero, to refine visual features and learn attribute localization for discriminative visual embedding representations in ZSL. Specifically, TransZero takes a feature augmentation encoder to alleviate the cross-dataset bias between ImageNet and ZSL benchmarks, and improves the transferability of visual features by reducing the entangled relative geometry relationships among region features. To learn locality-augmented visual features, TransZero employs a visual-semantic decoder to localize the image regions most relevant to each attribute in a given image, under the guidance of semantic attribute information. Then, the locality-augmented visual features and semantic vectors are used to conduct effective visual-semantic interaction in a visual-semantic embedding network. Extensive experiments show that TransZero achieves the new state of the art on three ZSL benchmarks. The codes are available at: https://github.com/shiming-chen/TransZero.

show abstract

“…Papers [9][10][11] are representatives of the second group. Their models use the class definition vectors as a fixed classification layer, and add modules that help the network localize and implicitly detect attributes in the visual space.…”

Section: Related Workmentioning

confidence: 99%

“…Their models use the class definition vectors as a fixed classification layer, and add modules that help the network localize and implicitly detect attributes in the visual space. We focus on and extend [9] in our Method section, since it does not require extra knowledge such as human gaze points as leveraged by [11], or too many added loss terms to fine-tune as proposed by [10].…”

Section: Related Workmentioning

confidence: 99%

“…Such approaches suffer from the shortcomings of whichever GANs and VAEs are used, e.g., mode collapse, and the difficulty in training and converging. The second group include discriminative approaches [9][10][11] which tries to learn a compatibility function that measures the compatibility of a sample's embedding with a class definition vector (typically an inner product or a nearest-neighbor are used to estimate compatibility), and classify an input sample as the class with the highest compatibility score. The objective in these frameworks is to learn a good mapping from the visual space (where image features reside) to the semantic space (where class definitions reside).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Using Fictitious Class Representations to Boost Discriminative Zero-Shot Learners

Dabbah

El‐Yaniv

2021

Preprint

View full text Add to dashboard Cite

Focusing on discriminative zero-shot learning, in this work we introduce a novel mechanism that dynamically augments during training the set of seen classes to produce additional fictitious classes. These fictitious classes diminish the model's tendency to fixate during training on attribute correlations that appear in the training set but will not appear in newly exposed classes. The proposed model is tested within the two formulations of the zero-shot learning framework; namely, generalized zero-shot learning (GZSL) and classical zero-shot learning (CZSL). Our model improves the state-of-the-art performance on the CUB dataset and reaches comparable results on the other common datasets, AWA2 and SUN. We investigate the strengths and weaknesses of our method, including the effects of catastrophic forgetting when training an end-to-end zero-shot model.

show abstract

Goal-Oriented Gaze Estimation for Zero-Shot Learning

Cited by 81 publications

References 54 publications

TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning

TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning

TransZero: Attribute-guided Transformer for Zero-Shot Learning

Using Fictitious Class Representations to Boost Discriminative Zero-Shot Learners

Contact Info

Product

Resources

About