Learning discriminative image feature embeddings is of great importance to visual recognition. To achieve better feature embeddings, most current methods focus on designing different network structures or loss functions, and the estimated feature embeddings are usually only related to the input images. In this paper, we propose Memory-based Neighbourhood Embedding (MNE) to enhance a general CNN feature by considering its neighbourhood. The method aims to solve two critical problems, i.e., how to acquire more relevant neighbours in the network training and how to aggregate the neighbourhood information for a more discriminative embedding. We first augment an episodic memory module into the network, which can provide more relevant neighbours for both training and testing. Then the neighbours are organized in a tree graph with the target instance as the root node. The neighbourhood information is gradually aggregated to the root node in a bottom-up manner, and aggregation weights are supervised by the class relationships between the nodes. We apply MNE on image search and few shot learning tasks. Extensive ablation studies demonstrate the effectiveness of each component, and our method significantly outperforms the state-of-theart approaches.
Video-based vehicle detection has received considerable attention over the last ten years and there are many deep learning based detection methods which can be applied to it. However, these methods are devised for still images and applying them for video vehicle detection directly always obtains poor performance. In this work, we propose a new single-stage video-based vehicle detector integrated with 3DCovNet and focal loss, called 3D-DETNet. Draw support from 3D Convolution network and focal loss, our method has ability to capture motion information and is more suitable to detect vehicle in video than other single-stage methods devised for static images. The multiple video frames are initially fed to 3D-DETNet to generate multiple spatial feature maps, then sub-model 3DConvNet takes spatial feature maps as input to capture temporal information which is fed to final fully convolution model for predicting locations of vehicles in video frames. We evaluate our method on UA-DETAC vehicle detection dataset and our 3D-DETNet yields best performance and keeps a higher detection speed of 26 fps compared with other competing methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.