Lei Zhang scite author profile

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

show abstract

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Wang

et al. 2019

View full text Add to dashboard Cite

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).

show abstract

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Anderson¹,

He²,

Buehler³

et al. 2017

Preprint

196

View full text Add to dashboard Cite

Scalable Sparse Subspace Clustering

Peng

Zhang

2013

136

View full text Add to dashboard Cite

show abstract

Machine‐Learning Designs of Anisotropic Digital Coding Metasurfaces

Zhang

Liu

Wan

et al. 2018

Advcd Theory and Sims

122

View full text Add to dashboard Cite

Digital coding representations of meta-atoms make it possible to realize intelligent designs of metasurfaces by means of machine learning algorithms. Here, a machine-learning method to design anisotropic digital coding metasurfaces is proposed, and meta-atoms may require any absolute phase values at different positions and under different polarizations. A deep-learning neural network to predict the vast and complex system is proposed, in which only 70 000 training coding patterns are used to train the network. Another 10 000 randomly chosen coding patterns are employed to validate the neural network, showing an accuracy of 90.05% of phase responses with 2°error in the 360°phase. Using the learned network, the correct coding pattern among 18 billion of billions of choices for the required phase can be readily found in a second, finishing automatic design of anisotropic meta-atoms. Three functional 1-bit anisotropic coding metasurfaces are intelligently achieved by the learned network. It is convenient to realize dual-beam scattering with left-handed circular polarization (LHCP) for one beam while right-handed circular polarization (RHCP) for the others, dual-beam scattering with circular polarization for one beam while linear polarization (LP) for the others, and triple-beam scattering with LHCP and RHCP for two beams while LP for the third one.

show abstract

Attention-Based Convolutional Neural Network for Weakly Labeled Human Activities’ Recognition With Wearable Sensors

2019

View full text Add to dashboard Cite

Unlike images or videos data which can be easily labeled by human being, sensor data annotation is a timeconsuming process. However, traditional methods of human activity recognition require a large amount of such strictly labeled data for training classifiers. In this paper, we present an attentionbased convolutional neural network for human recognition from weakly labeled data. The proposed attention model can focus on labeled activity among a long sequence of sensor data, and while filter out a large amount of background noise signals. In experiment on the weakly labeled dataset, we show that our attention model outperforms classical deep learning methods in accuracy. Besides, we determine the specific locations of the labeled activity in a long sequence of weakly labeled data by converting the compatibility score which is generated from attention model to compatibility density. Our method greatly facilitates the process of sensor data annotation, and makes data collection more easy.Index Terms-Human activity recognition, attention-based convolutional neural network, compatibility density, weakly supervised learning, wearable sensor data.

show abstract

Layer-Wise Training Convolutional Neural Networks With Smaller Filters for Human Activity Recognition Using Wearable Sensors

Tang¹,

Teng²,

Zhang³

et al. 2021

IEEE Sensors J.

View full text Add to dashboard Cite

The Layer-Wise Training Convolutional Neural Networks Using Local Loss for Sensor-Based Human Activity Recognition

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Lei Zhang

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Scalable Sparse Subspace Clustering

Machine‐Learning Designs of Anisotropic Digital Coding Metasurfaces

Attention-Based Convolutional Neural Network for Weakly Labeled Human Activities’ Recognition With Wearable Sensors

Layer-Wise Training Convolutional Neural Networks With Smaller Filters for Human Activity Recognition Using Wearable Sensors

The Layer-Wise Training Convolutional Neural Networks Using Local Loss for Sensor-Based Human Activity Recognition

Contact Info

Product

Resources

About