“…We feed this last feature map to a combination of depthwise and pointwise convolution layers to reduce the computational complexity, also known as Atrous Spatial Pyramid Pooling (ASPP). To control the receptive fields in depthwise convolution operations, we use different dilation rates (2,6,12,18,24) in the filters which helps us to extract semantic features at a multiscale. This dilation technique in depthwise convolution filters is also called the atrous separable convolution, a powerful tool, which is further described in the semantic high-level feature operations.…”
“…However, the CAMHID model only detected simple camera movements and did not identify complex camera movements. Therefore, Sandula et al [12] proposed a CNN-based camera motion classification model to classify 11 camera movement direction patterns using HSI (hue, saturation, intensity) color features. This method achieved an accuracy rate of 98.37% but did not capture the object motion descriptors.…”
Semantic video scene-understanding applications rely on object-camera motion recognition techniques for scene contextual movement representation. While existing machine learning-based methods perform efficiently, their primary limitation is to analyze motion patterns from normal frames only, neglecting the scene transition frames. This causes significant false alarms due to the undetected objectcamera motion patterns during scene transitions. In this paper, we propose a novel method for object and camera motion recognition of two consecutive scenes from their transition frames. First, our method detects cut transitions using principal component analysis (PCA) to segment the video into shots. Additionally, it eliminates large text transitions that are often falsely detected as cut transitions using structural similarity index measurement (SSIM) properties. Second, it selects candidate segments to localize normal and wipe transition frames using slope angle characteristics obtained from linear regression. Third, it extracts dense semantic spatial features at multi-scale using the modified DeepLabv3+ network to segment selected candidate frames into foreground, background, and wipe pixels. Finally, an optical flow algorithm-based temporal trajectory tracking model is applied on each segmented pixel to recognize the object, camera pan, zoom-in, and zoom-out motion patterns. We further remove falsely detected non-transition motion frames to improve wipe transition detection. The experimental results are obtained using the benchmark TRECVID and the multimedia datasets. The proposed method using pixel-level classification and temporal trajectory analysis achieved an average accuracy improvement of 9.28% for object-camera motion recognition, 3.75% for cut transition detection, and 3.01% for wipe transition detection.
“…We feed this last feature map to a combination of depthwise and pointwise convolution layers to reduce the computational complexity, also known as Atrous Spatial Pyramid Pooling (ASPP). To control the receptive fields in depthwise convolution operations, we use different dilation rates (2,6,12,18,24) in the filters which helps us to extract semantic features at a multiscale. This dilation technique in depthwise convolution filters is also called the atrous separable convolution, a powerful tool, which is further described in the semantic high-level feature operations.…”
“…However, the CAMHID model only detected simple camera movements and did not identify complex camera movements. Therefore, Sandula et al [12] proposed a CNN-based camera motion classification model to classify 11 camera movement direction patterns using HSI (hue, saturation, intensity) color features. This method achieved an accuracy rate of 98.37% but did not capture the object motion descriptors.…”
Semantic video scene-understanding applications rely on object-camera motion recognition techniques for scene contextual movement representation. While existing machine learning-based methods perform efficiently, their primary limitation is to analyze motion patterns from normal frames only, neglecting the scene transition frames. This causes significant false alarms due to the undetected objectcamera motion patterns during scene transitions. In this paper, we propose a novel method for object and camera motion recognition of two consecutive scenes from their transition frames. First, our method detects cut transitions using principal component analysis (PCA) to segment the video into shots. Additionally, it eliminates large text transitions that are often falsely detected as cut transitions using structural similarity index measurement (SSIM) properties. Second, it selects candidate segments to localize normal and wipe transition frames using slope angle characteristics obtained from linear regression. Third, it extracts dense semantic spatial features at multi-scale using the modified DeepLabv3+ network to segment selected candidate frames into foreground, background, and wipe pixels. Finally, an optical flow algorithm-based temporal trajectory tracking model is applied on each segmented pixel to recognize the object, camera pan, zoom-in, and zoom-out motion patterns. We further remove falsely detected non-transition motion frames to improve wipe transition detection. The experimental results are obtained using the benchmark TRECVID and the multimedia datasets. The proposed method using pixel-level classification and temporal trajectory analysis achieved an average accuracy improvement of 9.28% for object-camera motion recognition, 3.75% for cut transition detection, and 3.01% for wipe transition detection.
Anonymous proxies are used by criminals for illegal network activities due to their anonymity, such as data theft and cyber attacks. Therefore, anonymous proxy traffic detection is very essential for network security. In recent years, detection based on deep learning has become a hot research topic, since deep learning can automatically extract and select traffic features. To make (heterogeneous) network traffic adapt to the homogeneous input of typical deep learning algorithms, a major branch of existing studies convert network traffic into images for detection. However, such studies are commonly subject to the limitation of large-sized image representation of network traffic, resulting in very large storage and computational resource overhead. To address this limitation, a novel method for anonymous proxy traffic detection is proposed. The method is one of the solutions to reduce storage and computational resource overhead. Specifically, it converts the sequences of the size and inter-arrival time of the first N packets of a flow into images, and then categorizes the converted images using the one-dimensional convolutional neural network. Both proprietary and public datasets are used to validate the proposed method. The experimental results show that the converted images of the method are at least 90% smaller than that of existing image-based deep learning methods. With substantially smaller image sizes, the method can still achieve F1 scores up to 98.51% in Shadowsocks traffic detection and 99.8% in VPN traffic detection.
“…Finally, a time-series feature vector for classification trained the support vector machine (SVM). Sandula et al (2021) constructed a new camera motion classification framework based on the hue-saturation-intensity (HSI) model to compress block motion vectors. The designed framework sends the input to the inter-frame block motion vector decoded by the compressed stream to estimate its size and direction and assign the motion vector direction to hue and the motion vector size to saturation under a fixed Intensity.…”
Section: Introductionmentioning
confidence: 99%
“…After that, CNN was used for supervised learning to identify 11 camera motion modes that include seven pure camera motion modes and four hybrid camera modes. The results showed that the recognition accuracy of this method for 11 camera modes reached over 98% (Sandula et al, 2021). Rajesh and Muralidhara (2021) designed a reconstruction loss based on new driving and used the implicit multivariate Markov random field regularization method to enhance local details.…”
Sports videos are blowing up over the internet with enriching material life and the higher pursuit of spiritual life of people. Thus, automatically identifying and detecting helpful information from videos have arisen as a relatively novel research direction. Accordingly, the present work proposes a Human Pose Estimation (HPE) model to automatically classify sports videos and detect hot spots in videos to solve the deficiency of traditional algorithms. Firstly, Deep Learning (DL) is introduced. Then, amounts of human motion features are extracted by the Region Proposal Network (RPN). Next, an HPE model is implemented based on Deep Convolutional Neural Network (DCNN). Finally, the HPE model is applied to motion recognition and video classification in sports videos. The research findings corroborate that an effective and accurate HPE model can be implemented using the DCNN to recognize and classify videos effectively. Meanwhile, Big Data Technology (BDT) is applied to count the playing amounts of various sports videos. It is convinced that the HPE model based on DCNN can effectively and accurately classify the sports videos and then provide a basis for the following statistics of various sports videos by BDT. Finally, a new outlook is proposed to apply new technology in the entertainment industry.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.