Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel conditional global feature which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the conditional global feature. The conditional global feature can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image caption. Experiments show that our proposed conditional attention model achieves the best performance on the SVHN (Street View House Numbers) dataset with / without extra bounding box; and for image caption, our attention model generates better scores than the popular soft attention model.
Recent successes in machine translation[1], speech recognition [2], and image caption [3] have witnessed the important role of attention mechanism. In computer vision, like human visual system, attention does not need to focus on the whole image, but only on the salient areas of the image. For example, [4], [5], [6], [7] embedded attention mechanism into image caption which enables the model to learn to automatically generate a caption describing the content of an image. Subsequently, attention approaches were introduced into the emerging visual question answering task (VQA) which greatly improved the overall performance [8] [9] [7]. Recently, [10] proposed a novel end-to-end trainable attention module for convolutional neural network architectures. The core idea of their work lies in estimating the attention maps by measuring how the local convolutional feature aligns to the global feature, which is different from the 1 arXiv:1911.04365v1 [cs.CV]