Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Hussain, Altaf; Hussain, Tanveer; Ullah, Waseem; Baik, Sung Wook

doi:10.1155/2022/3454167

Cited by 41 publications

(29 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video classification [94] is an interesting domain with main focus on effective temporal contents representation that significantly contributes to precise video label prediction. Although the early approaches in video classification are based on simple CNNs [39], but the recent methods employ various temporal [8] and spatio-temporal [34] strategies for video classification. Video classification is further divided into several major domains such as activity recognition, anomaly detection and recognition, and violence detection and recognition.…”

Section: Deep Learning For Vdmentioning

confidence: 99%

An overview of violence detection techniques: current challenges and future directions

et al. 2022

View full text Add to dashboard Cite

The Big Video Data generated in today's smart cities has raised concerns from its purposeful usage perspective, where surveillance cameras, among many others are the most prominent resources to contribute to the huge volumes of data, making its automated analysis a difficult task in terms of computation and preciseness. Violence Detection (VD), broadly plunging under Action and Activity recognition domain, is used to analyze Big Video data for anomalous actions incurred due to humans. The VD literature is traditionally based on manually engineered features, though advancements to deep learning based standalone models are developed for

show abstract

Section: Deep Learning For Vdmentioning

confidence: 99%

An overview of violence detection techniques: current challenges and future directions

et al. 2022

View full text Add to dashboard Cite

show abstract

“…The captioning system developed by [36], [37], [38], [39], [40], [41], [42] demonstrated the employment of visual, local, global, adaptive, spatial, temporal, and channel attention for coherent and diverse caption generation. [44], [45], [46], [47] long term dependency handling is not an issue anymore for researchers engaged in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes.…”

Section: ) Encoder-decoder (Ed) Based Approachesmentioning

confidence: 99%

DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

Rafiq

et al. 2022

IEEE Access

View full text Add to dashboard Cite

Video description is one of the most challenging task in the combined domain of computer vision and natural language processing. Captions for various open and constrained domain videos have been generated in the recent past but descriptions for driving dashcam videos have never been explored to the best of our knowledge. With the aim to explore dashcam video description generation for autonomous driving, this study presents DeepRide: a large-scale dashcam driving video description dataset for locationaware dense video description generation. The human-described dataset comprises visual scenes and actions with diverse weather, people, objects, and geographical paradigms. It bridges the autonomous driving domain with video description by textual description generation of the visual information as seen by a dashcam. We describe 16,000 videos (40 seconds each) in English employing 2,700 man-hours by two highly qualified teams with domain knowledge. The descriptions consist of eight to ten sentences covering each dashcam video's global features and event features in 60 to 90 words. The dataset consists of more than 130K sentences, totaling approximately one million words. We evaluate the dataset by employing location aware vision-language recurrent transformer framework to elaborate on the efficacy and significance of the visio-linguistics research for autonomous vehicles. We provided base line results to evaluate the dataset by employing three existing state-of-the-art recurrent models. The memory augmented transformer performed superior due to its highly summarized memory state for visual information and the sentence history while generating the trip description. Our proposed dataset opens a new dimension of diverse and exciting applications, such as self-driving vehicle reporting, driver and vehicle safety, inter-vehicle road intelligence sharing, and travel occurrence reports. INDEX TERMS dashcam video description, video captioning, autonomous trip descriptionComprehending the localized events of a video appropriately and then transforming the attained visual understand-

show abstract

“…e task of nonlinear mapping and feature extraction is extremely challenging; therefore, the best way to tackle these challenges is to employ deep learning models with the ability to extract the discriminative features end-toend [29,30]. In recent years, the application of deep learning models has significantly improved for image classification [31,32], video classification [33][34][35][36][37], and power forecasting in TS data [38][39][40][41][42]. For instance, Khan et al [43] proposed a hybrid model for electricity forecasting in residential and commercial buildings.…”

Section: Introductionmentioning

confidence: 99%

A Hybrid Deep Learning‐Based Network for Photovoltaic Power Forecasting

et al. 2022

Self Cite

View full text Add to dashboard Cite

For efficient energy distribution, microgrids (MG) provide significant assistance to main grids and act as a bridge between the power generation and consumption. Renewable energy generation resources, particularly photovoltaics (PVs), are considered as a clean source of energy but are highly complex, volatile, and intermittent in nature making their forecasting challenging. Thus, a reliable, optimized, and a robust forecasting method deployed at MG objectifies these challenges by providing accurate renewable energy production forecasting and establishing a precise power generation and consumption matching at MG. Furthermore, it ensures effective planning, operation, and acquisition from the main grid in the case of superior or inferior amounts of energy, respectively. Therefore, in this work, we develop an end-to-end hybrid network for automatic PV power forecasting, comprising three basic steps. Firstly, data preprocessing is performed to normalize, remove the outliers, and deal with the missing values prominently. Next, the temporal features are extracted using deep sequential modelling schemes, followed by the extraction of spatial features via convolutional neural networks. These features are then fed to fully connected layers for optimal PV power forecasting. In the third step, the proposed model is evaluated on publicly available PV power generation datasets, where its performance reveals lower error rates when compared to state-of-the-art methods.

show abstract

Vision Transformer and Deep Sequence Learning for Human Activity Recognition in Surveillance Videos

Cited by 41 publications

References 43 publications

An overview of violence detection techniques: current challenges and future directions

An overview of violence detection techniques: current challenges and future directions

DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

A Hybrid Deep Learning‐Based Network for Photovoltaic Power Forecasting

Contact Info

Product

Resources

About