TEVAD: Improved video anomaly detection with captions

Chen, Weiling; Ma, Keng Teck; Jian Yew, Zi; Hur, Minhoe; Khoo, David Aik-Aun

doi:10.1109/cvprw59228.2023.00587

Cited by 5 publications

(14 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A person is walking in the road Anomaly Score prediction Score: 0.95 weakly supervised models mostly use single-domain video data. Recent works reported that the single domain data is not sufficient for complex scene understating where we have complex backgrounds and a high number of object interactions [9,10,11]. Next, recent VAD models first extract video features using I3D/C3D networks [12,13].…”

Section: Swinbertmentioning

confidence: 99%

“…Next, recent VAD models first extract video features using I3D/C3D networks [12,13]. In the feature extraction process, all previous work relies on fixed-scale frame segmentation, where video snippet bags are created at fixed frame intervals [9,14,3]. The problem with a fixed frame rate is that all anomalous events are not the same in the temporal dimension; hence, as illustrated in Figure 2, the short anomalous events are not accurately captured with a long-term fixed segmentation rate.…”

Section: Swinbertmentioning

confidence: 99%

“…An accurate fusion process is essential in order to aggregate rich semantic information. Finally, in the last few years, magnitude-based feature learning [3,9] has been widely used for learning normal and abnormal scene features. However, the idea of calculating a single value to represent normality and abnormality is not always accurate [15,1].…”

Section: Swinbertmentioning

confidence: 99%

“…To address the issues mentioned above, we propose multimodal video anomaly detection (MMVAD) (See Figure 3). Inspired by the work [9], we use text captions generated from video snippet bags. Since text features are semantically rich, we use text features as the second domain.…”

Section: Swinbertmentioning

confidence: 99%

“…The normal frame creates noise/ambiguity in the learning process. To fix this issue, several of the latest works use magnitude-based feature learning [3,9,18]. Although magnitude-based feature learning is not always accurate, the high magnitude value from the feature can be due to the high number of objects or intense object interaction in the scene [15,1].…”

Section: Weakly Supervised Anomaly Detectionmentioning

confidence: 99%

See 4 more Smart Citations