Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).
Digital coding representations of meta-atoms make it possible to realize intelligent designs of metasurfaces by means of machine learning algorithms. Here, a machine-learning method to design anisotropic digital coding metasurfaces is proposed, and meta-atoms may require any absolute phase values at different positions and under different polarizations. A deep-learning neural network to predict the vast and complex system is proposed, in which only 70 000 training coding patterns are used to train the network. Another 10 000 randomly chosen coding patterns are employed to validate the neural network, showing an accuracy of 90.05% of phase responses with 2°error in the 360°phase. Using the learned network, the correct coding pattern among 18 billion of billions of choices for the required phase can be readily found in a second, finishing automatic design of anisotropic meta-atoms. Three functional 1-bit anisotropic coding metasurfaces are intelligently achieved by the learned network. It is convenient to realize dual-beam scattering with left-handed circular polarization (LHCP) for one beam while right-handed circular polarization (RHCP) for the others, dual-beam scattering with circular polarization for one beam while linear polarization (LP) for the others, and triple-beam scattering with LHCP and RHCP for two beams while LP for the third one.
Unlike images or videos data which can be easily labeled by human being, sensor data annotation is a timeconsuming process. However, traditional methods of human activity recognition require a large amount of such strictly labeled data for training classifiers. In this paper, we present an attentionbased convolutional neural network for human recognition from weakly labeled data. The proposed attention model can focus on labeled activity among a long sequence of sensor data, and while filter out a large amount of background noise signals. In experiment on the weakly labeled dataset, we show that our attention model outperforms classical deep learning methods in accuracy. Besides, we determine the specific locations of the labeled activity in a long sequence of weakly labeled data by converting the compatibility score which is generated from attention model to compatibility density. Our method greatly facilitates the process of sensor data annotation, and makes data collection more easy.Index Terms-Human activity recognition, attention-based convolutional neural network, compatibility density, weakly supervised learning, wearable sensor data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.