2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 2021
DOI: 10.1109/iccvw54120.2021.00314
|View full text |Cite
|
Sign up to set email alerts
|

ViT-YOLO:Transformer-Based YOLO for Object Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 143 publications
(61 citation statements)
references
References 21 publications
0
38
0
Order By: Relevance
“…Second, the FGAM will be optimized to further improve its performance and robustness. State-of-the-art neural networks, such as ViT based YOLO [103], will be analyzed and compared with both the current attention model and the fine-grained localization model. We will actively search for detection approaches that are capable of addressing the jittering issues.…”
Section: F Future Research Directionsmentioning
confidence: 99%
“…Second, the FGAM will be optimized to further improve its performance and robustness. State-of-the-art neural networks, such as ViT based YOLO [103], will be analyzed and compared with both the current attention model and the fine-grained localization model. We will actively search for detection approaches that are capable of addressing the jittering issues.…”
Section: F Future Research Directionsmentioning
confidence: 99%
“…etc., which surpassed CNN-based ResNet and showed excellent performance in downstream tasks such as classification [44], segmentation [47], object detection [48]. Although there is a trend of grand unification of transformer in the field of NLP and image, the development of transformer in the field of point cloud is highly slow.…”
Section: Related Workmentioning
confidence: 99%
“…Vision transformers in sequence-based problems have shown tremendous performance, particularly, for image recognition and detection tasks [22,98]. Similarly, TimeSformer are introduced for precise video classification tasks such as action and activity recognition and video understanding [6].…”
Section: Vision Transformers In Vd Domainmentioning
confidence: 99%