A common trend in object recognition is to detect and lever
We propose a definition of saliency by considering what the visual system is trying to optimize when directing attention. The resulting model is a Bayesian framework from which bottom-up saliency emerges naturally as the self-information of visual features, and overall saliency (incorporating top-down information with bottom-up saliency) emerges as the pointwise mutual information between the features and the target when searching for a target. An implementation of our framework demonstrates that our model's bottom-up saliency maps perform as well as or better than existing algorithms in predicting people's fixations in free viewing. Unlike existing saliency measures, which depend on the statistics of the particular image being viewed, our measure of saliency is derived from natural image statistics, obtained in advance from a collection of natural images. For this reason, we call our model SUN (Saliency Using Natural statistics). A measure of saliency based on natural image statistics, rather than based on a single test image, provides a straightforward explanation for many search asymmetries observed in humans; the statistics of a single test image lead to predictions that are not consistent with these asymmetries. In our model, saliency is computed locally, which is consistent with the neuroanatomy of the early visual system and results in an efficient algorithm with few free parameters.
This article examines the human face as a transmitter of expression signals and the brain as a decoder of these expression signals. If the face has evolved to optimize transmission of such signals, the basic facial expressions should have minimal overlap in their information. If the brain has evolved to optimize categorization of expressions, it should be efficient with the information available from the transmitter for the task. In this article, we characterize the information underlying the recognition of the six basic facial expression signals and evaluate how efficiently each expression is decoded by the underlying brain structures.
Recent advances in deep learning, especially deep convolutional neural networks (CNNs), have led to significant improvement over previous semantic segmentation systems.Here we show how to improve pixel-wise semantic segmentation by manipulating convolution-related operations that are of both theoretical and practical value. First, we design dense upsampling convolution (DUC) to generate pixel-level prediction, which is able to capture and decode more detailed information that is generally missing in bilinear upsampling. Second, we propose a hybrid dilated convolution (HDC) framework in the encoding phase. This framework 1) effectively enlarges the receptive fields (RF) of the network to aggregate global information; 2) alleviates what we call the "gridding issue"caused by the standard dilated convolution operation. We evaluate our approaches thoroughly on the Cityscapes dataset, and achieve a state-of-art result of 80.1% mIOU in the test set at the time of submission. We also have achieved state-of-theart overall on the KITTI road estimation benchmark and the PASCAL VOC2012 segmentation task. Our source code can be found at https
The Nonlinear autoregressive exogenous (NARX) model, which predicts the current value of a time series based upon its previous values as well as the current and past values of multiple driving (exogenous) series, has been studied for decades. Despite the fact that various NARX models have been developed, few of them can capture the long-term temporal dependencies appropriately and select the relevant driving series to make predictions. In this paper, we propose a dual-stage attention-based recurrent neural network (DA-RNN) to address these two issues. In the first stage, we introduce an input attention mechanism to adaptively extract relevant driving series (a.k.a., input features) at each time step by referring to the previous encoder hidden state. In the second stage, we use a temporal attention mechanism to select relevant encoder hidden states across all time steps. With this dual-stage attention scheme, our model can not only make predictions effectively, but can also be easily interpreted. Thorough empirical studies based upon the SML 2010 dataset and the NASDAQ 100 Stock dataset demonstrate that the DA-RNN can outperform state-of-the-art methods for time series prediction.
It is well known that there exist preferred landing positions for eye fixations in visual word recognition. However, the existence of preferred landing positions in face recognition is less well established. It is also unknown how many fixations are required to recognize a face. To investigate these questions, we recorded eye movements during face recognition. During an otherwise standard face-recognition task, subjects were allowed a variable number of fixations before the stimulus was masked. We found that optimal recognition performance is achieved with two fixations; performance does not improve with additional fixations. The distribution of the first fixation is just to the left of the center of the nose, and that of the second fixation is around the center of the nose. Thus, these appear to be the preferred landing positions for face recognition. Furthermore, the fixations made during face learning differ in location from those made during face recognition and are also more variable in duration; this suggests that different strategies are used for face learning and face recognition.
We use the logographic characteristic of Chinese orthography to examine whether face-specific effects, such as holistic processing and the left side bias effect, can also be observed in expertise-level Chinese character processing by comparing novices' and experts' perception of Chinese characters. We show that non-Chinese readers (novices) perceive characters more holistically than Chinese readers (experts). Chinese readers have a better awareness of the components of characters, which are not clearly separable to novices. This suggests that holistic processing is not a general visual expertise marker; it depends on the features of the stimuli and the tasks typically performed on them. In contrast, similar to face perception, Chinese readers have a left side bias effect in the perception of mirror-symmetric characters, whereas novices do not; this effect is also reflected in their eye fixation behavior. This suggests that the left side bias effect may be a visual expertise marker.3
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.