Detecting vehicles in aerial imagery plays an important role in a wide range of applications. The current vehicle detection methods are mostly based on sliding-window search and handcrafted or shallow-learning-based features, having limited description capability and heavy computational costs. Recently, due to the powerful feature representations, region convolutional neural networks (CNN) based detection methods have achieved state-of-the-art performance in computer vision, especially Faster R-CNN. However, directly using it for vehicle detection in aerial images has many limitations: (1) region proposal network (RPN) in Faster R-CNN has poor performance for accurately locating small-sized vehicles, due to the relatively coarse feature maps; and (2) the classifier after RPN cannot distinguish vehicles and complex backgrounds well. In this study, an improved detection method based on Faster R-CNN is proposed in order to accomplish the two challenges mentioned above. Firstly, to improve the recall, we employ a hyper region proposal network (HRPN) to extract vehicle-like targets with a combination of hierarchical feature maps. Then, we replace the classifier after RPN by a cascade of boosted classifiers to verify the candidate regions, aiming at reducing false detection by negative example mining. We evaluate our method on the Munich vehicle dataset and the collected vehicle dataset, with improvements in accuracy and robustness compared to existing methods.
Vehicle detection with orientation estimation in aerial images has received widespread interest as it is important for intelligent traffic management. This is a challenging task, not only because of the complex background and relatively small size of the target, but also the various orientations of vehicles in aerial images captured from the top view. The existing methods for oriented vehicle detection need several post-processing steps to generate final detection results with orientation, which are not efficient enough. Moreover, they can only get discrete orientation information for each target. In this paper, we present an end-to-end single convolutional neural network to generate arbitrarily-oriented detection results directly. Our approach, named Oriented_SSD (Single Shot MultiBox Detector, SSD), uses a set of default boxes with various scales on each feature map location to produce detection bounding boxes. Meanwhile, offsets are predicted for each default box to better match the object shape, which contain the angle parameter for oriented bounding boxes' generation. Evaluation results on the public DLR Vehicle Aerial dataset and Vehicle Detection in Aerial Imagery (VEDAI) dataset demonstrate that our method can detect both the location and orientation of the vehicle with high accuracy and fast speed. For test images in the DLR Vehicle Aerial dataset with a size of 5616 × 3744, our method achieves 76.1% average precision (AP) and 78.7% correct direction classification at 5.17 s on an NVIDIA GTX-1060.
As two different tools for earth observation, the optical and synthetic aperture radar (SAR) images can provide complementary information of the same land types for better land cover classification. However, because of the different imaging mechanisms of optical and SAR images, how to efficiently exploit the complementary information becomes an interesting and challenging problem. In this article, we propose a novel multimodal bilinear fusion network (MBFNet), which is used to fuse the optical and SAR features for land cover classification. The MBFNet consists of three components: the feature extractor, the second-order attention-based channel selection module (SACSM), and the bilinear fusion module. First, in order to avoid the network parameters tempting to ingratiate dominant modality, the pseudosiamese convolutional neural network (CNN) is taken as the feature extractor to extract deep semantic feature maps of optical and SAR images, respectively. Then, the SACSM is embedded into each stream, and the fine channel-attention maps with second-order statistics are obtained by bilinear integrating the global averagepooling and global max-pooling information. The SACSM can not only automatically highlight the important channels of feature maps to improve the representation power of networks, but also uses the channel selection mechanism to reconfigure compact feature maps with better discrimination. Finally, the bilinear pooling is used as the feature-level fusion method, which establishes the second-order association between two compact feature maps of the optical and SAR streams to obtain the low-dimension bilinear fusion features for land cover classification. Experimental results on three broad coregistered optical and SAR datasets demonstrate that our method achieves more effective land cover classification performance than the state-of-the-art methods.
Index Terms-Attention mechanism, bilinear pooling model, convolutional neural network (CNN), feature fusion, land cover classification, multimodal learning. Xiao Li received the M.S. degrees in control science and engineering from Xiangtan University, Xiangtan, China, in 2018. He is currently working toward the Ph.D. degree in information and communication engineering from the
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.