Object detection in remote sensing images on a satellite or aircraft has important economic and military significance and is full of challenges. This task requires not only accurate and efficient algorithms, but also highperformance and low power hardware architecture. However, existing deep learning based object detection algorithms require further optimization in small objects detection, reduced computational complexity and parameter size. Meanwhile, the generalpurpose processor cannot achieve better power efficiency, and the previous design of deep learning processor has still potential for mining parallelism. To address these issues, we propose an efficient contextbased feature fusion single shot multibox detector (CBFFSSD) framework, using lightweight MobileNet as the backbone network to reduce parameters and computational complexity, adding feature fusion units and detecting feature maps to enhance the recognition of small objects and improve detection accuracy. Based on the analysis and optimization of the calculation of each layer in the algorithm, we propose efficient hardware architecture of deep learning processor with multiple neural processing units (NPUs) composed of 2D processing elements (PEs), which can simultaneously calculate multiple output feature maps. The parallel architecture, hierarchical onchip storage organization, and the local register are used to achieve parallel processing, sharing and reuse of data, and make the calculation of processor more efficient. Extensive experiments and comprehensive evaluations on the public NWPU VHR10 dataset and comparisons with some stateoftheart approaches demonstrate the effectiveness and superiority of the proposed framework. Moreover, for evaluating the performance of proposed hardware architecture, we implement it on Xilinx XC7Z100 field programmable gate array (FPGA) and test on the proposed CBFFSSD and VGG16 models. Experimental results show that our processor are more power efficient than general purpose central processing units (CPUs) and graphics processing units (GPUs), and have better performance density than other stateoftheart FPGAbased designs.
The expansion and improvement of synthetic aperture radar (SAR) technology have greatly enhanced its practicality. SAR imaging requires real-time processing with limited power consumption for large input images. Designing a specific heterogeneous array processor is an effective approach to meet the power consumption constraints and real-time processing requirements of an application system. In this paper, taking a commonly used algorithm for SAR imaging—the chirp scaling algorithm (CSA)—as an example, the characteristics of each calculation stage in the SAR imaging process is analyzed, and the data flow model of SAR imaging is extracted. A heterogeneous array architecture for SAR imaging that effectively supports Fast Fourier Transformation/Inverse Fast Fourier Transform (FFT/IFFT) and phase compensation operations is proposed. First, a heterogeneous array architecture consisting of fixed-point PE units and floating-point FPE units, which are respectively proposed for the FFT/IFFT and phase compensation operations, increasing energy efficiency by 50% compared with the architecture using floating-point units. Second, data cross-placement and simultaneous access strategies are proposed to support the intra-block parallel processing of SAR block imaging, achieving up to 115.2 GOPS throughput. Third, a resource management strategy for heterogeneous computing arrays is designed, which supports the pipeline processing of FFT/IFFT and phase compensation operation, improving PE utilization by a factor of 1.82 and increasing energy efficiency by a factor of 1.5. Implemented in 65-nm technology, the experimental results show that the processor can achieve energy efficiency of up to 254 GOPS/W. The imaging fidelity and accuracy of the proposed processor were verified by evaluating the image quality of the actual scene.
Properties that are similar to the memory and learning functions in biological systems have been observed and reported in the experimental studies of memristors fabricated by different materials. These properties include the forgetting effect, the transition from short-term memory (STM) to long-term memory (LTM), learning-experience behavior, etc. The mathematical model of this kind of memristor would be very important for its theoretical analysis and application design. In our analysis of the existing memristor model with these properties, we find that some behaviors of the model are inconsistent with the reported experimental observations. A phenomenological memristor model is proposed for this kind of memristor. The model design is based on the forgetting effect and STM-to-LTM transition since these behaviors are two typical properties of these memristors. Further analyses of this model show that this model can also be used directly or modified to describe other experimentally observed behaviors. Simulations show that the proposed model can give a better description of the reported memory and learning behaviors of this kind of memristor than the existing model.
The behavior of transition from short-term memory (STM) to long-term memory (LTM) has been observed and reported in the experimental studies of memristors fabricated by different materials. This kind of memristor in this paper is named STM→LTM memristor. In some of these experimental researches, the learning-experience behavior observed in the " learning-forgetting-relearning” experiment is also reported. When the memristor is restimulated by pulses after forgetting the STM, its memory will quickly return to the highest state that has been reached before the forgetting period, and the memory recovery during the relearning period is obviously faster than the memory formation in the first learning process. In this paper, the behavior of the existing STM→LTM memristor model in the " learning-forgetting-relearning” experiment is further discussed. If <i>w</i><sub>max</sub>, the upper bound of the memory level, is a constant with a value of 1, the STM→LTM memristor model exhibits no learning-experience behavior, and this model shows a faster relearning behavior in the " learning-forgetting-relearning” experiment. The relearning process is faster because the memory forgetting during pulse-to-pulse interval in the relearning process is slower than that in the first learning process. In the STM→LTM memristor model with learning-experience behavior, <i>w</i><sub>max</sub> is redesigned as a state variable in [0,1], and its value will be influenced by the applied voltage. The memory formation in the first learning process is relatively slow because <i>w</i><sub>max</sub> limits the memory formation speed when the pulse is applied. After the forgetting process, the limitation of <i>w</i><sub>max</sub> on the pulse-induced memory formation is less obvious, so the memory of the device increases at a faster speed during the memory recovery of the relearning process. In this case, the forgetting speed still becomes slower after each pulse has been applied. If the pulse-induced <i>w</i><sub>max</sub> increase is so fast that <i>w</i><sub>max</sub> will quickly increase to its upper bound after a few pulses have been applied in the first learning process, and the learning-experience behavior is similar to the faster relearning behavior when <i>w</i><sub>max</sub> = 1. In most of experimental research papers about the STM→LTM memristor, the change of the memristance can be explained by the formation and annihilation of the conductive channel between two electrodes of a memristor. During a certain period of time, the ions (or vacancies), which can be used to form the conductive channel, are only those that are around the conductive channel, which indicates that there should be an upper bound for the size of the conductive channel within this time period. The area in which ions (or vacancies) can be used to form the conductive channel is called the surrounding area of the conductive channel. In the model, <i>w</i><sub>max</sub> can be understood as the size of the conductive channel’s surrounding area, and it describes the upper bound of the width of the conductive channel.
With the development of deep learning technologies and edge computing, the combination of them can make artificial intelligence ubiquitous. Due to the constrained computation resources of the edge device, the research in the field of on-device deep learning not only focuses on the model accuracy but also on the model efficiency, for example, inference latency. There are many attempts to optimize the existing deep learning models for the purpose of deploying them on the edge devices that meet specific application requirements while maintaining high accuracy. Such work not only requires professional knowledge but also needs a lot of experiments, which limits the customization of neural networks for varied devices and application scenarios. In order to reduce the human intervention in designing and optimizing the neural network structure, multi-objective neural architecture search methods that can automatically search for neural networks featured with high accuracy and can satisfy certain hardware performance requirements are proposed. However, the current methods commonly set accuracy and inference latency as the performance indicator during the search process, and sample numerous network structures to obtain the required neural network. Lacking regulation to the search direction with the search objectives will generate a large number of useless networks during the search process, which influences the search efficiency to a great extent. Therefore, in this paper, an efficient resource-aware search method is proposed. Firstly, the network inference consumption profiling model for any specific device is established, and it can help us directly obtain the resource consumption of each operation in the network structure and the inference latency of the entire sampled network. Next, on the basis of the Bayesian search, a resource-aware Pareto Bayesian search is proposed. Accuracy and inference latency are set as the constraints to regulate the search direction. With a clearer search direction, the overall search efficiency will be improved. Furthermore, cell-based structure and lightweight operation are applied to optimize the search space for further enhancing the search efficiency. The experimental results demonstrate that with our method, the inference latency of the searched network structure reduced 94.71% without scarifying the accuracy. At the same time, the search efficiency increased by 18.18%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.