Abstract:The low resolution of objects of interest in aerial images makes pedestrian detection and action detection extremely challenging tasks. Furthermore, using deep convolutional neural networks to process large images can be demanding in terms of computational requirements. In order to alleviate these challenges, we propose a two-step, yes and no question answering framework to find specific individuals doing one or multiple specific actions in aerial images. First, a deep object detector, Single Shot Multibox Det… Show more
“…We propose a framework in which, first, an SSD [3], which has shown promising performances in the aerial image object detection literature [2] and [17], generates a number of objects of interest proposals for an input aerial image. These proposals might contain vehicle, background, or other objects.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…In order to alleviate the challenge of objects occupying small number of pixels, we split the problem into two sub- problems [2]. We first assume that a deep detector like Single Shot Multibox Detector (SSD) [3] extracts objects or areas of interest, and second, we use a deep convolutional network to recognize which of the extracted objects of interest are also the vehicles we wish to detect.…”
This paper investigates the problem of aerial vehicle recognition using a text-guided deep convolutional neural network classifier. The network receives an aerial image and a desired class, and makes a yes or no output by matching the image and the textual description of the desired class. We train and test our model on a synthetic aerial dataset and our desired classes consist of the combination of the class types and colors of the vehicles. This strategy helps when considering more classes in testing than in training. 1
“…We propose a framework in which, first, an SSD [3], which has shown promising performances in the aerial image object detection literature [2] and [17], generates a number of objects of interest proposals for an input aerial image. These proposals might contain vehicle, background, or other objects.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…In order to alleviate the challenge of objects occupying small number of pixels, we split the problem into two sub- problems [2]. We first assume that a deep detector like Single Shot Multibox Detector (SSD) [3] extracts objects or areas of interest, and second, we use a deep convolutional network to recognize which of the extracted objects of interest are also the vehicles we wish to detect.…”
This paper investigates the problem of aerial vehicle recognition using a text-guided deep convolutional neural network classifier. The network receives an aerial image and a desired class, and makes a yes or no output by matching the image and the textual description of the desired class. We train and test our model on a synthetic aerial dataset and our desired classes consist of the combination of the class types and colors of the vehicles. This strategy helps when considering more classes in testing than in training. 1
“…Their work is limited to singlelabel HAR, since their detection algorithm, i.e., the Single Shot multi-box Detector (SSD) [18], cannot handle multiple labels. In [12], the authors use a VGG neural network to extract visual features from objects of interest. They subsequently concatenate these features with a bag-of-words representation by using the Visual Question Answering technique [19].…”
Section: Related Workmentioning
confidence: 99%
“…The 3D-Conv layer outputs twelve 3D feature maps, C = {C (1) , C (2) , ...C (12) }, one for each fixed 3D filter. Each feature map in C has L 2D feature maps of spatial dimensions W × H, where L is the number of frames of the input action tube, as defined before.…”
“…However, working with drone videos is also a challenging task due to multiple changes in the pose and size of objects, occlusions and camera motion. The recent introduction of the Okutama-Action dataset [9] has facilitated the development of solutions for HAR using drone videos [12,13] . This dataset has succeed in integrating videos depicting real-world, aerial-view scenes of multiple human actions.…”
Please refer to published version for the most recent bibliographic citation information. If a published version is known of, the repository item page linked to above, will contain details on accessing it.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.