We address the issue of visual saliency from three perspectives. First, we consider saliency detection as a frequency domain analysis problem. Second, we achieve this by employing the concept of nonsaliency. Third, we simultaneously consider the detection of salient regions of different size. The paper proposes a new bottom-up paradigm for detecting visual saliency, characterized by a scale-space analysis of the amplitude spectrum of natural images. We show that the convolution of the image amplitude spectrum with a low-pass Gaussian kernel of an appropriate scale is equivalent to an image saliency detector. The saliency map is obtained by reconstructing the 2D signal using the original phase and the amplitude spectrum, filtered at a scale selected by minimizing saliency map entropy. A Hypercomplex Fourier Transform performs the analysis in the frequency domain. Using available databases, we demonstrate experimentally that the proposed model can predict human fixation data. We also introduce a new image database and use it to show that the saliency detector can highlight both small and large salient regions, as well as inhibit repeated distractors in cluttered images. In addition, we show that it is able to predict salient regions on which people focus their attention.
The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods are used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms are called RLS-TD( λ ) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. RLS-TD( λ ) can be viewed as the extension of RLS-TD(0) from λ =0 to general 0≤ λ ≤1, so it is a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence with probability one and the limit of convergence of RLS-TD( λ ) are proved for ergodic Markov chains. Compared to the existing LS-TD( λ ) algorithm, RLS-TD( λ ) has advantages in computation and is more suitable for online learning. The effectiveness of RLS-TD( λ ) is analyzed and verified by learning prediction experiments of Markov chains with a wide range of parameter settings.The Fast-AHC algorithm is derived by applying the proposed RLS-TD( λ ) algorithm in the critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the experimental results, it is shown that the data efficiency of learning control can also be improved by using RLS methods in the learning-prediction process of the critic. The performance of Fast-AHC is also compared with that of the AHC method using LS-TD( λ ). Furthermore, it is demonstrated in the experiments that different initial values of the variance matrix in RLS-TD( λ ) are required to get better performance not only in learning prediction but also in learning control. The experimental results are analyzed based on the existing theoretical work on the transient phase of forgetting factor RLS methods.
Online Multi-Object Tracking (MOT) from videos is a challenging computer vision task which has been extensively studied for decades. Most of the existing MOT algorithms are based on the Tracking-by-Detection (TBD) paradigm combined with popular machine learning approaches which largely reduce the human effort to tune algorithm parameters. However, the commonly used supervised learning approaches require the labeled data (e.g., bounding boxes), which is expensive for videos. Also, the TBD framework is usually suboptimal since it is not end-to-end, i.e., it considers the task as detection and tracking, but not jointly. To achieve both label-free and end-to-end learning of MOT, we propose a Tracking-by-Animation framework, where a differentiable neural model first tracks objects from input frames and then animates these objects into reconstructed frames. Learning is then driven by the reconstruction error through backpropagation. We further propose a Reprioritized Attentive Tracking to improve the robustness of data association. Experiments conducted on both synthetic and real video datasets show the potential of the proposed model. Our project page is publicly available at: https://github. com/zhen-he/tracking-by-animation
Abstract:Feature-based matching methods have been widely used in remote sensing image matching given their capability to achieve excellent performance despite image geometric and radiometric distortions. However, most of the feature-based methods are unreliable for complex background variations, because the gradient or other image grayscale information used to construct the feature descriptor is sensitive to image background variations. Recently, deep learning-based methods have been proven suitable for high-level feature representation and comparison in image matching. Inspired by the progresses made in deep learning, a new technical framework for remote sensing image matching based on the Siamese convolutional neural network is presented in this paper. First, a Siamese-type network architecture is designed to simultaneously learn the features and the corresponding similarity metric from labeled training examples of matching and non-matching true-color patch pairs. In the proposed network, two streams of convolutional and pooling layers sharing identical weights are arranged without the manually designed features. The number of convolutional layers is determined based on the factors that affect image matching. The sigmoid function is employed to compute the matching and non-matching probabilities in the output layer. Second, a gridding sub-pixel Harris algorithm is used to obtain the accurate localization of candidate matches. Third, a Gaussian pyramid coupling quadtree is adopted to gradually narrow down the searching space of the candidate matches, and multiscale patches are compared synchronously. Subsequently, a similarity measure based on the output of the sigmoid is adopted to find the initial matches. Finally, the random sample consensus algorithm and the whole-to-local quadratic polynomial constraints are used to remove false matches. In the experiments, different types of satellite datasets, such as ZY3, GF1, IKONOS, and Google Earth images, with complex background variations are used to evaluate the performance of the proposed method. The experimental results demonstrate that the proposed method, which can significantly improve the matching performance of multi-temporal remote sensing images with complex background variations, is better than the state-of-the-art matching methods. In our experiments, the proposed method obtained a large number of evenly distributed matches (at least 10 times more than other methods) and achieved a high accuracy (less than 1 pixel in terms of root mean square error).
We propose a new saliency detection model by combining global information from frequency domain analysis and local information from spatial domain analysis. In the frequency domain analysis, instead of modeling salient regions, we model the nonsalient regions using global information; these so-called repeating patterns that are not distinctive in the scene are suppressed by using spectrum smoothing. In spatial domain analysis, we enhance those regions that are more informative by using a center-surround mechanism similar to that found in the visual cortex. Finally, the outputs from these two channels are combined to produce the saliency map. We demonstrate that the proposed model has the ability to highlight both small and large salient regions in cluttered scenes and to inhibit repeating objects. Experimental results also show that the proposed model outperforms existing algorithms in predicting objects regions where human pay more attention.
Road detection is an essential component of field robot navigation systems. Vision sensors play an important role in road detection for their great potential in environmental perception. In this paper, we propose a hierarchical vision sensor-based method for robust road detection in challenging road scenes. More specifically, for a given road image captured by an on-board vision sensor, we introduce a multiple population genetic algorithm (MPGA)-based approach for efficient road vanishing point detection. Superpixel-level seeds are then selected in an unsupervised way using a clustering strategy. Then, according to the GrowCut framework, the seeds proliferate and iteratively try to occupy their neighbors. After convergence, the initial road segment is obtained. Finally, in order to achieve a globally-consistent road segment, the initial road segment is refined using the conditional random field (CRF) framework, which integrates high-level information into road detection. We perform several experiments to evaluate the common performance, scale sensitivity and noise sensitivity of the proposed method. The experimental results demonstrate that the proposed method exhibits high robustness compared to the state of the art.
In this study, an approach using ground control point-free unmanned aerial vehicle (UAV)-based photogrammetry is proposed to estimate the volume of stockpiles carried on barges in a dynamic environment. Compared with similar studies regarding UAVs, an indirect absolute orientation based on the geometry of the vessel is used to establish a custom-built framework that can provide a unified reference instead of prerequisite ground control points (GCPs). To ensure sufficient overlap and reduce manual intervention, the stereo images are extracted from a UAV video for aerial triangulation. The region of interest is defined to exclude the area of water in all UAV images using a simple linear iterative clustering algorithm, which segments the UAV images into superpixels and helps to improve the accuracy of image matching. Structure-from-motion is used to recover three-dimensional geometry from the overlapping images without assistance of exterior parameters obtained from the airborne global positioning system and inertial measurement unit. Then, the semi-global matching algorithm is used to generate stockpile-covered and stockpile-free surface models. These models are oriented into a custom-built framework established by the known distance, such as the length and width of the vessel, and they do not require GCPs for coordinate transformation. Lastly, the volume of a stockpile is estimated by multiplying the height difference between the stockpile-covered and stockpile-free surface models by the size of the grid that is defined using the resolution of these models. Results show that a relatively small deviation of approximately ±2% between the volume estimated by UAV photogrammetry and the volume calculated by traditional manual measurement was obtained. Therefore, the proposed approach can be considered the better solution for the volume measurement of stockpiles carried on barges in a dynamic environment because UAV-based photogrammetry not only attains superior density and spatial object accuracy but also remarkably reduces data collection time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.