ViT-YOLO:Transformer-Based YOLO for Object Detection

Zhang, Zixiao; Lu, Xiaoqiang; Cao, Guojin; Yang, Yuting; Jiao, Licheng; Liu, Fang

doi:10.1109/iccvw54120.2021.00314

Cited by 143 publications

(61 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, the FGAM will be optimized to further improve its performance and robustness. State-of-the-art neural networks, such as ViT based YOLO [103], will be analyzed and compared with both the current attention model and the fine-grained localization model. We will actively search for detection approaches that are capable of addressing the jittering issues.…”

Section: F Future Research Directionsmentioning

confidence: 99%

A Fine-Grained Attention Model for High Accuracy Operational Robot Guidance

Chu

Feng

Liu

et al. 2023

IEEE Internet Things J.

View full text Add to dashboard Cite

Deep learning enhanced Internet of Things (IoT) is advancing the transformation towards smart manufacturing. Intelligent robot guidance is one of the most potential deep learn-ing+IoT applications in the manufacturing industry. However, low costs, efficient computing, and extremely high localization accuracy are mandatory requirements for vision robot guidance, particularly in operational factories. Therefore in this work, a low-cost edge computing based IoT system is developed based on an innovative Fine-Grained Attention Model (FGAM). FGAM integrates a deep-learning based attention model to detect the Region Of Interest (ROI) and an optimized conventional computer vision model to perform fine-grained localization concentrating on the ROI. Trained with only 100 images collected from real production line, the proposed FGAM has shown superior performance over multiple benchmark models when validated using operational data. Eventually, the FGAM based edge computing system has been deployed on a welding robot in a real-world factory for mass production. After the assembly of about 6000 products, the deployed system has achieved averaged overall process and transmission time down to 200 ms and overall localization accuracy up to 99.998%.

show abstract

Section: F Future Research Directionsmentioning

confidence: 99%

A Fine-Grained Attention Model for High Accuracy Operational Robot Guidance

Chu

Feng

Liu

et al. 2023

IEEE Internet Things J.

View full text Add to dashboard Cite

show abstract

“…etc., which surpassed CNN-based ResNet and showed excellent performance in downstream tasks such as classification [44], segmentation [47], object detection [48]. Although there is a trend of grand unification of transformer in the field of NLP and image, the development of transformer in the field of point cloud is highly slow.…”

Section: Related Workmentioning

confidence: 99%

POS-BERT: Point Cloud One-Stage BERT Pre-Training

Fu¹,

Gao²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, the pre-training paradigm combining Transformer and masked language modeling has achieved tremendous success in NLP, images, and point clouds, such as BERT. However, directly extending BERT from NLP to point clouds requires training a fixed discrete Variational AutoEncoder (dVAE) before pre-training, which results in a complex two-stage method called Point-BERT. Inspired by BERT and MoCo, we propose POS-BERT, a one-stage BERT pre-training method for point clouds. Specifically, we use the mask patch modeling (MPM) task to perform point cloud pre-training, which aims to recover masked patches information under the supervision of the corresponding tokenizer output. Unlike Point-BERT, its tokenizer is extra-trained and frozen. We propose to use the dynamically updated momentum encoder as the tokenizer, which is updated and outputs the dynamic supervision signal along with the training process. Further, in order to learn high-level semantic representation, we combine contrastive learning to maximize the class token consistency between different transformation point clouds. Extensive experiments have demonstrated that POS-BERT can extract high-quality pre-training features and promote downstream tasks to improve performance. Using the pretraining model without any fine-tuning to extract features and train linear SVM on ModelNet40, POS-BERT achieves the state-of-the-art classification accuracy, which exceeds Point-BERT by 3.5%. In addition, our approach has significantly improved many downstream tasks, such as fine-tuned classification, few-shot classification, part segmentation. The code and trained-models will be available at: https://github.com/fukexue/POS-BERT.

show abstract

“…Vision transformers in sequence-based problems have shown tremendous performance, particularly, for image recognition and detection tasks [22,98]. Similarly, TimeSformer are introduced for precise video classification tasks such as action and activity recognition and video understanding [6].…”

Section: Vision Transformers In Vd Domainmentioning

confidence: 99%

An overview of violence detection techniques: current challenges and future directions

et al. 2022

View full text Add to dashboard Cite

The Big Video Data generated in today's smart cities has raised concerns from its purposeful usage perspective, where surveillance cameras, among many others are the most prominent resources to contribute to the huge volumes of data, making its automated analysis a difficult task in terms of computation and preciseness. Violence Detection (VD), broadly plunging under Action and Activity recognition domain, is used to analyze Big Video data for anomalous actions incurred due to humans. The VD literature is traditionally based on manually engineered features, though advancements to deep learning based standalone models are developed for

show abstract

ViT-YOLO:Transformer-Based YOLO for Object Detection

Cited by 143 publications

References 21 publications

A Fine-Grained Attention Model for High Accuracy Operational Robot Guidance

A Fine-Grained Attention Model for High Accuracy Operational Robot Guidance

POS-BERT: Point Cloud One-Stage BERT Pre-Training

An overview of violence detection techniques: current challenges and future directions

Contact Info

Product

Resources

About