“…A person is walking in the road Anomaly Score prediction Score: 0.95 weakly supervised models mostly use single-domain video data. Recent works reported that the single domain data is not sufficient for complex scene understating where we have complex backgrounds and a high number of object interactions [9,10,11]. Next, recent VAD models first extract video features using I3D/C3D networks [12,13].…”
Section: Swinbertmentioning
confidence: 99%
“…Next, recent VAD models first extract video features using I3D/C3D networks [12,13]. In the feature extraction process, all previous work relies on fixed-scale frame segmentation, where video snippet bags are created at fixed frame intervals [9,14,3]. The problem with a fixed frame rate is that all anomalous events are not the same in the temporal dimension; hence, as illustrated in Figure 2, the short anomalous events are not accurately captured with a long-term fixed segmentation rate.…”
Section: Swinbertmentioning
confidence: 99%
“…An accurate fusion process is essential in order to aggregate rich semantic information. Finally, in the last few years, magnitude-based feature learning [3,9] has been widely used for learning normal and abnormal scene features. However, the idea of calculating a single value to represent normality and abnormality is not always accurate [15,1].…”
Section: Swinbertmentioning
confidence: 99%
“…To address the issues mentioned above, we propose multimodal video anomaly detection (MMVAD) (See Figure 3). Inspired by the work [9], we use text captions generated from video snippet bags. Since text features are semantically rich, we use text features as the second domain.…”
Section: Swinbertmentioning
confidence: 99%
“…The normal frame creates noise/ambiguity in the learning process. To fix this issue, several of the latest works use magnitude-based feature learning [3,9,18]. Although magnitude-based feature learning is not always accurate, the high magnitude value from the feature can be due to the high number of objects or intense object interaction in the scene [15,1].…”
“…A person is walking in the road Anomaly Score prediction Score: 0.95 weakly supervised models mostly use single-domain video data. Recent works reported that the single domain data is not sufficient for complex scene understating where we have complex backgrounds and a high number of object interactions [9,10,11]. Next, recent VAD models first extract video features using I3D/C3D networks [12,13].…”
Section: Swinbertmentioning
confidence: 99%
“…Next, recent VAD models first extract video features using I3D/C3D networks [12,13]. In the feature extraction process, all previous work relies on fixed-scale frame segmentation, where video snippet bags are created at fixed frame intervals [9,14,3]. The problem with a fixed frame rate is that all anomalous events are not the same in the temporal dimension; hence, as illustrated in Figure 2, the short anomalous events are not accurately captured with a long-term fixed segmentation rate.…”
Section: Swinbertmentioning
confidence: 99%
“…An accurate fusion process is essential in order to aggregate rich semantic information. Finally, in the last few years, magnitude-based feature learning [3,9] has been widely used for learning normal and abnormal scene features. However, the idea of calculating a single value to represent normality and abnormality is not always accurate [15,1].…”
Section: Swinbertmentioning
confidence: 99%
“…To address the issues mentioned above, we propose multimodal video anomaly detection (MMVAD) (See Figure 3). Inspired by the work [9], we use text captions generated from video snippet bags. Since text features are semantically rich, we use text features as the second domain.…”
Section: Swinbertmentioning
confidence: 99%
“…The normal frame creates noise/ambiguity in the learning process. To fix this issue, several of the latest works use magnitude-based feature learning [3,9,18]. Although magnitude-based feature learning is not always accurate, the high magnitude value from the feature can be due to the high number of objects or intense object interaction in the scene [15,1].…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.