Security surveillance is critical to social harmony and people's peaceful life. It has a great impact on strengthening social stability and life safeguarding. Detecting anomaly timely, effectively and efficiently in video surveillance remains challenging. This paper proposes a new approach, called S 2-VAE, for anomaly detection from video data. The S 2-VAE consists of two proposed neural networks: a Stacked Fully Connected Variational AutoEncoder (SF-VAE) and a Skip Convolutional VAE (SC-VAE). The SF-VAE is a shallow generative network to obtain a Gaussian mixture like model to fit the distribution of the actual data. The SC-VAE, as a key component of S 2-VAE, is a deep generative network to take advantages of CNN, VAE and skip connections. Both SF-VAE and SC-VAE are efficient and effective generative networks and they can achieve better performance for detecting both local abnormal events and global abnormal events. The proposed S 2-VAE is evaluated using four public datasets. The experimental results show that the S 2-VAE outperforms the state-of-the-art algorithms. The code will be available publicly at https://github.com/tianwangbuaa/.
In this paper, we present a model pretraining technique, named MaskOCR, for text recognition. Our text recognition architecture is an encoder-decoder transformer: the encoder extracts the patch-level representations, and the decoder recognizes the text from the representations. Our approach pretrains both the encoder and the decoder in a sequential manner. (i) We pretrain the encoder in a self-supervised manner over a large set of unlabeled real text images. We adopt the masked image modeling approach, which shows the effectiveness for general images, expecting that the representations take on semantics. (ii) We pretrain the decoder over a large set of synthesized text images in a supervised manner and enhance the language modeling capability of the decoder by randomly masking some text image patches occupied by characters input to the encoder and accordingly the representations input to the decoder. Experiments show that the proposed MaskOCR approach achieves superior results on the benchmark datasets, including Chinese and English text images.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.