Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

Luo, Chunjie; Zhan, Jianfeng; Xue, Xiaohe; Wang, Lei; Ren, Rui; Yang, Qiang

doi:10.1007/978-3-030-01418-6_38

Cited by 139 publications

(80 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Otherwise, a smaller weight is assigned. Here, we use the cosine similarity metric [27] to measure the similarity between the warped features and the features extracted from the reference frame. Moreover, we do not directly use the convolutional features obtained from N feat (I).…”

Section: Model Designmentioning

confidence: 99%

Flow-Guided Feature Aggregation for Video Object Detection

Zhu

Wang

Dai

et al. 2017

2017 IEEE International Conference on Computer Vision (ICCV)

576

614

View full text Add to dashboard Cite

Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flowguided feature aggregation, an accurate and end-to-end learning framework for video object detection. It leverages temporal coherence on feature level instead. It improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy. Our method significantly improves upon strong single-frame baselines in ImageNet VID [33], especially for more challenging fast moving objects. Our framework is principled, and on par with the best engineered systems winning the ImageNet VID challenges 2016, without additional bells-and-whistles. The proposed method, together with Deep Feature Flow [49], powered the winning entry of ImageNet VID challenges 2017 1 . The code is available at https://github.com/msracver/ Flow-Guided-Feature-Aggregation. * This work is done when Xizhou Zhu and Yujie Wang are interns at Microsoft Research Asia 1 http://image-net.org/challenges/LSVRC/2017/ results 1

show abstract

Section: Model Designmentioning

confidence: 99%

Flow-Guided Feature Aggregation for Video Object Detection

Zhu

Wang

Dai

et al. 2017

2017 IEEE International Conference on Computer Vision (ICCV)

576

614

View full text Add to dashboard Cite

show abstract

“…In our model, as a result of normalizing embeddings and columns of the weight matrix, the magnitude differences do not affect the prediction as long as the angle between the normalized vectors remains the same, since the inner product w i φ(x) ∈ [−1, 1] now measures cosine similarity. Recent work in cosine normalization [9] discusses a similar idea of replacing the inner product with a cosine similarity for bounded activations and stable training, while we arrive at this design from a different direction. In particular, this establishes a symmetric relationship between normalized embeddings and weights, which enables us to treat them interchangeably.…”

Section: Model Architecturementioning

confidence: 99%

Low-Shot Learning with Imprinted Weights

Brown

Lowe

2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

499

413

View full text Add to dashboard Cite

Human vision is able to immediately recognize novel visual categories after seeing just one or a few training examples. We describe how to add a similar capability to ConvNet classifiers by directly setting the final layer weights from novel training examples during low-shot learning. We call this process weight imprinting as it directly sets weights for a new category based on an appropriately scaled copy of the embedding layer activations for that training example. The imprinting process provides a valuable complement to training with stochastic gradient descent, as it provides immediate good classification performance and an initialization for any further fine-tuning in the future. We show how this imprinting process is related to proxy-based embeddings. However, it differs in that only a single imprinted weight vector is learned for each novel category, rather than relying on a nearest-neighbor distance to training instances as typically used with embedding methods. Our experiments show that using averaging of imprinted weights provides better generalization than using nearest-neighbor instance embeddings.

show abstract

“…Traditional multi-layer neural networks use dot product between the output vector of previous layer and the incoming weight vector as the input to activation function. [23,11] recently showed that replacing the dot product with cosine similarity can bound and reduce the variance of the neurons and thus result in models of better generalization. Considering that we are trying to calculate the correlation between data from two dramatically different domains, especially for the attribute domain in which the features are discontinuous and have high variances.…”

Section: Zero-shot Learningmentioning

confidence: 99%

“…We speculate the reason is that values of class attribute are not continuous such that there are large variance among the attribute vectors of different classes. Consequently, classifier weights derived from them also possess large variance, which might cause high variances of inputs to the Softmax activation function [23]. Unlike dot product, our cosine similarity based score function normalizes the classi-fier weights before calculating its dot product with visual embeddings.…”

Section: Ablation Studiesmentioning

confidence: 99%

Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective

Min

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

116

View full text Add to dashboard Cite

Zero-shot learning (ZSL) aims to recognize instances of unseen classes solely based on the semantic descriptions of the classes. Existing algorithms usually formulate it as a semantic-visual correspondence problem, by learning mappings from one feature space to the other. Despite being reasonable, previous approaches essentially discard the highly precious discriminative power of visual features in an implicit way, and thus produce undesirable results. We instead reformulate ZSL as a conditioned visual classification problem, i.e., classifying visual features based on the classifiers learned from the semantic descriptions. With this reformulation, we develop algorithms targeting various ZSL settings: For the conventional setting, we propose to train a deep neural network that directly generates visual feature classifiers from the semantic attributes with an episode-based training scheme; For the generalized setting, we concatenate the learned highly discriminative classifiers for seen classes and the generated classifiers for unseen classes to classify visual features of all classes; For the transductive setting, we exploit unlabeled data to effectively calibrate the classifier generator using a novel learning-without-forgetting self-training mechanism and guide the process by a robust generalized cross-entropy loss. Extensive experiments show that our proposed algorithms significantly outperform state-of-the-art methods by large margins on most benchmark datasets in all the ZSL settings. Our code is available at https://github. com/kailigo/cvcZSL

show abstract

Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

Cited by 139 publications

References 3 publications

Flow-Guided Feature Aggregation for Video Object Detection

Flow-Guided Feature Aggregation for Video Object Detection

Low-Shot Learning with Imprinted Weights

Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective

Contact Info

Product

Resources

About