TransAnomaly: Video Anomaly Detection Using Video Vision Transformer

Hu, Yuan; Cai, Zhenyu; Zhou, Hui; Wang, Yue; Chen, Xiangzhi

doi:10.1109/access.2021.3109102

Cited by 46 publications

(25 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hence, one may assume applying ViT for VAD task is not appropriate, as the amount of VAD datasets is not that large compared to the other tasks such as image classification or object detection. Several approaches exploiting Transformer to the anomaly detection [3], [14] are also only based on the Transformer with convolutional layers, not ViT. However, our approach proves that ViT can be successfully trained to detect anomalies in video even without a huge amount of data.…”

Section: B Vision Transformermentioning

confidence: 83%

“…As shown in TABLE IV, we compare our model with other previous VAD approaches which can be classified into three main categories: reconstruction-based, predictionbased, and the hybrid methods. First, reconstruction-based methods include Conv-AE [4], 3D-Conv [35], MemAE [1], and MNAD-R [6], while MNAD-P [6], AMMC-Net [8], Frame-Pred [5], VEC [7], C2-D2GAN [3], and Transanomaly [14], are prediction-based methods and HF2-VAD [2] is the hybrid method. We observe that our model achieves better results than other state-of-the-art methods in Ped2, except for HF2-VAD [2].…”

Section: F Resultsmentioning

confidence: 99%

“…If the reconstruction error is larger than a given threshold, the frames are considered to be abnormal, since the model does not learn abnormal frames at train time thus the model cannot reconstruct abnormal frames effectively. In the future frame prediction-based methods [3], [5], [7], [8], [14], on the other hand, the generative models learn to predict the next frame based on a given set of previous frames. By predicting previously unseen frames, they become able to learn enhanced feature representations, which typically leads to the better anomaly detection results than the reconstructionbased ones.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Lee¹,

Nam²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

Video Anomaly Detection(VAD) has been traditionally tackled in two main methodologies: the reconstruction-based approach and the prediction-based one. As the reconstructionbased methods learn to generalize the input image, the model merely learns an identity function and strongly causes the problem called generalizing issue. On the other hand, since the prediction-based ones learn to predict a future frame given several previous frames, they are less sensitive to the generalizing issue. However, it is still uncertain if the model can learn the spatio-temporal context of a video. Our intuition is that the understanding of the spatio-temporal context of a video plays a vital role in VAD as it provides precise information on how the appearance of an event in a video clip changes. Hence, to fully exploit the context information for anomaly detection in video circumstances, we designed the transformer model with three different contextual prediction streams: masked, whole and partial. By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video, which leads to a high reconstruction error at the abnormal cases that are unsuitable to the learned context. To verify the effectiveness of our approach, we assess our model on the public benchmark datasets: USCD Pedestrian 2, CUHK Avenue and ShanghaiTech and evaluate the performance with the anomaly score metric of reconstruction error. The results demonstrate that our proposed approach achieves a competitive performance compared to the existing video anomaly detection methods.

show abstract

Section: B Vision Transformermentioning

confidence: 83%

Section: F Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Lee¹,

Nam²,

Lee³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Yuan et al in [119] proposed TransAnomaly, a video ViT and U-Net-based framework for the detection of the anomalies in the videos. They used three datasets, Pred1, Pred2, and Avenue.…”

Section: Vits For Anomaly Detectionmentioning

confidence: 99%

A Comprehensive Survey of Transformers for Computer Vision

Jamil

Piran

Kwon

2023

Drones

View full text Add to dashboard Cite

As a special type of transformer, vision transformers (ViTs) can be used for various computer vision (CV) applications. Convolutional neural networks (CNNs) have several potential problems that can be resolved with ViTs. For image coding tasks such as compression, super-resolution, segmentation, and denoising, different variants of ViTs are used. In our survey, we determined the many CV applications to which ViTs are applicable. CV applications reviewed included image classification, object detection, image segmentation, image compression, image super-resolution, image denoising, anomaly detection, and drone imagery. We reviewed the state of the-art and compiled a list of available models and discussed the pros and cons of each model.

show abstract

“…Another hybrid model for SC classification was proposed by Sharma et al [ 48 ] who fused the features of cascaded ensembling of CNN and a handcrafted features-based DL model and achieved state-of-the-art performance. There is no doubt that vision transformers play an important role in several vision-based challenging applications, such as fire detection [ 49 , 50 ], anomaly detection [ 51 ], and medical image classification [ 52 , 53 ]. It is well documented that, according to the recent literature, multiclass SC classification is not an easy task because of the large amount of similarity in the dermoscopic images.…”

Section: Introductionmentioning

confidence: 99%

An Effective Skin Cancer Classification Mechanism via Medical Vision Transformer

Aladhadh

Alsanea²,

Aloraini

et al. 2022

Sensors

View full text Add to dashboard Cite

Skin Cancer (SC) is considered the deadliest disease in the world, killing thousands of people every year. Early SC detection can increase the survival rate for patients up to 70%, hence it is highly recommended that regular head-to-toe skin examinations are conducted to determine whether there are any signs or symptoms of SC. The use of Machine Learning (ML)-based methods is having a significant impact on the classification and detection of SC diseases. However, there are certain challenges associated with the accurate classification of these diseases such as a lower detection accuracy, poor generalization of the models, and an insufficient amount of labeled data for training. To address these challenges, in this work we developed a two-tier framework for the accurate classification of SC. During the first stage of the framework, we applied different methods for data augmentation to increase the number of image samples for effective training. As part of the second tier of the framework, taking into consideration the promising performance of the Medical Vision Transformer (MVT) in the analysis of medical images, we developed an MVT-based classification model for SC. This MVT splits the input image into image patches and then feeds these patches to the transformer in a sequence structure, like word embedding. Finally, Multi-Layer Perceptron (MLP) is used to classify the input image into the corresponding class. Based on the experimental results achieved on the Human Against Machine (HAM10000) datasets, we concluded that the proposed MVT-based model achieves better results than current state-of-the-art techniques for SC classification.

show abstract

TransAnomaly: Video Anomaly Detection Using Video Vision Transformer

Cited by 46 publications

References 22 publications

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

A Comprehensive Survey of Transformers for Computer Vision

An Effective Skin Cancer Classification Mechanism via Medical Vision Transformer

Contact Info

Product

Resources

About