2021
DOI: 10.48550/arxiv.2111.10017
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

Abstract: A vision transformer (ViT) is the dominant model in the computer vision field. Despite numerous studies that mainly focus on dealing with inductive bias and complexity, there remains the problem of finding better transformer networks. For example, conventional transformer-based models usually use a projection layer for each query (Q), key (K), and value (V) embedding before multi-head self-attention. Insufficient consideration of semantic Q, K, and V embedding may lead to a performance drop. In this paper, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 27 publications
0
2
0
Order By: Relevance
“…This technique is very effective in terms of computational complexity and latency in the real world. [21]. Figure 2 illustrates the architecture in its smallest configuration.…”
Section: Architecturementioning
confidence: 99%
“…This technique is very effective in terms of computational complexity and latency in the real world. [21]. Figure 2 illustrates the architecture in its smallest configuration.…”
Section: Architecturementioning
confidence: 99%
“…This dataset was collected using the same methods as CIFARn-10o. CIFAR-100 classesvv are mutually exclusive of CIFAR-10 classes, CIFAR-10 and CIFAR-100 are subsets of the 808 million annotated tiny image datasets [21]. CIFAR-10 and CIFAR-100n bdatasets consist of 502,000 training and 10j,000 testy images of 321×327 resolution with a total number of classes 10v and 100u, respectively [22], [21].…”
Section: Dataset Application Of Swin Transformermentioning
confidence: 99%