We propose a novel end-to-end trainable, deep, encoderdecoder architecture for single-pass semantic segmentation. Our approach is based on a cascaded architecture with feature-level long-range skip connections. The encoder incorporates the structure of ResNeXt's residual building blocks and adopts the strategy of repeating a building block that aggregates a set of transformations with the same topology. The decoder features a novel architecture, consisting of blocks, that (i) capture context information, (ii) generate semantic features, and (iii) enable fusion between different output resolutions. Crucially, we introduce dense decoder shortcut connections to allow decoder blocks to use semantic feature maps from all previous decoder levels, i.e. from all higher-level feature maps. The dense decoder connections allow for effective information propagation from one decoder block to another, as well as for multi-level feature fusion that significantly improves the accuracy. Importantly, these connections allow our method to obtain state-of-the-art performance on several challenging datasets, without the need of time-consuming multi-scale averaging of previous works.
In this paper, we focus on the important topic of violence recognition and detection in surveillance videos. Our goal is to determine if a violence occurs in a video (recognition) and when it happens (detection). Firstly, we propose an extension of the Improved Fisher Vectors (IFV) for videos, which allows to represent a video using both local features and their spatio-temporal positions. Then, we study the popular sliding window approach for violence detection, and we re-formulate the Improved Fisher Vectors and use the summed area table data structure to speed up the approach. We present an extensive evaluation, comparison and analysis of the proposed improvements on 4 state-ofthe-art datasets. We show that the proposed improvements make the violence recognition more accurate (as compared to the standard IFV, IFV with spatio-temporal grid, and other state-of-the-art methods) and make the violence detection significantly faster.
We propose a novel approach for rapid segmentation of flooded buildings by fusing multiresolution, multisensor, and multitemporal satellite imagery in a convolutional neural network. Our model significantly expedites the generation of satellite imagery-based flood maps, crucial for first responders and local authorities in the early stages of flood events. By incorporating multitemporal satellite imagery, our model allows for rapid and accurate post-disaster damage assessment and can be used by governments to better coordinate medium-and long-term financial assistance programs for affected areas. The network consists of multiple streams of encoder-decoder architectures that extract spatiotemporal information from medium-resolution images and spatial information from high-resolution images before fusing the resulting representations into a single medium-resolution segmentation map of flooded buildings. We compare our model to state-of-the-art methods for building footprint segmentation as well as to alternative fusion approaches for the segmentation of flooded buildings and find that our model performs best on both tasks. We also demonstrate that our model produces highly accurate segmentation maps of flooded buildings using only publicly available medium-resolution data instead of significantly more detailed but sparsely available very high-resolution data. We release the first open-source dataset of fully preprocessed and labeled multiresolution, multispectral, and multitemporal satellite images of disaster sites along with our source code.
Body height, weight, as well as the associated and composite body mass index (BMI) are human attributes of pertinence due to their use in a number of applications including surveillance, re-identification, image retrieval systems, as well as healthcare. Previous work on automated estimation of height, weight and BMI has predominantly focused on 2D and 3D fullbody images and videos. Little attention has been given to the use of face for estimating such traits. Motivated by the above, we here explore the possibility of estimating height, weight and BMI from single-shot facial images by proposing a regression method based on the 50-layers ResNet-architecture. In addition, we present a novel dataset consisting of 1026 subjects and show results, which suggest that facial images contain discriminatory information pertaining to height, weight and BMI, comparable to that of body-images and videos. Finally, we perform a genderbased analysis of the prediction of height, weight and BMI.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.