Hao Tan scite author profile

Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR 2 , and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pretraining strategies significantly contribute to our strong results. 1

show abstract

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Tan

Bansal

2019

Preprint

223

365

View full text Add to dashboard Cite

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

et al. 2017

View full text Add to dashboard Cite

Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a unified framework for the tasks of referring expression comprehension and generation. Our model is composed of three modules: speaker, listener, and reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions. The listenerspeaker modules are trained jointly in an end-to-end learning framework, allowing the modules to be aware of one another during learning while also benefiting from the discriminative reinforcer's feedback. We demonstrate that this unified framework and training achieves state-of-theart results for both comprehension and generation on three referring expression datasets. Project and demo page: https://vision.cs.unc.edu/refer.

show abstract

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Tan¹,

Yu²,

Bansal³

2019

173

108

View full text Add to dashboard Cite

A grand goal in AI is to build a robot that can accurately navigate based on natural language instructions, which requires the agent to perceive the scene, understand and ground language, and act in the real-world environment. One key challenge here is to learn to navigate in new environments that are unseen during training. Most of the existing approaches perform dramatically worse in unseen environments as compared to seen ones. In this paper, we present a generalizable navigational agent. Our agent is trained in two stages. The first stage is training via mixed imitation and reinforcement learning, combining the benefits from both off-policy and on-policy optimization. The second stage is fine-tuning via newly-introduced 'unseen' triplets (environment, path, instruction). To generate these unseen triplets, we propose a simple but effective 'environmental dropout' method to mimic unseen environments, which overcomes the problem of limited seen environment variability. Next, we apply semi-supervised learning (via back-translation) on these droppedout environments to generate new paths and instructions. Empirically, we show that our agent is substantially better at generalizability when fine-tuned with these triplets, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard. 1

show abstract

Estimation of k‐space trajectories in spiral MRI

Tan

Meyer

2009

Magnetic Resonance in Med

View full text Add to dashboard Cite

For non-Cartesian data acquisition in MRI, k-space trajectory infidelity due to eddy current effects and other hardware imperfections will blur and distort the reconstructed images. Even with the shielded gradients and eddy current compensation techniques of current scanners, the deviation between the actual k-space trajectory and the requested trajectory remains a major reason for image artifacts in non-Cartesian MRI. It is often not practical to measure the k-space trajectory for each imaging slice. It has been reported that better image quality is achieved in radial scanning by correcting anisotropic delays on different physical gradient axes. In this article the delay model is applied in spiral k-space trajectory estimation to reduce image artifacts. Then a novel estimation method combining the anisotropic delay model and a simple convolution eddy current model further reduces the artifact level in spiral image reconstruction. The root mean square error and peak error in both phantom and in vivo images reconstructed using the estimated Key words: MRI; spiral imaging; eddy currents; k-space trajectoryIn MRI the theoretical k-space trajectory is proportional to the integral of the gradient current through each gradient coil. However, the actual k-space trajectory is always distorted by many undesired effects in spatial encoding such as eddy currents and anisotropic gradient amplifier delays. To reduce the effects of eddy currents, manufacturers have active shielding and pre-emphasis filters in current scanners to eliminate most of the errors. However, the residual error can still cause severe image artifacts, especially in non-Cartesian scanning such as radial and spiral imaging (1-3).If the k-space trajectory is not distorted severely and the k-space center is sampled, we can use the actual k-space trajectory in the reconstruction and remove most of the artifacts. (11) proposed magnetic field monitoring (MFM) during the MRI data acquisition using field probes. This method is very promising since it can remove undesired phase terms in each individual scan. One limitation is that the probes have to be aligned with each imaging slice.Aside from uncompensated eddy current effects leading to distortions of the gradient waveform shape, another significant problem is small timing delay errors arising in the hardware. Peters et al. (12) measured the delays on different physical gradient axis in a calibration scan and corrected the miscentering of k-space on each projection angle using that information. Davies and Jezzard (13) proposed calibration and correction methods for gradient propagation delays for 2D RF pulses to improve the positional accuracy. Speier and Trautwein (14) parameterized the gradient delays for radial imaging applications. There are also frequency demodulation delays especially in offaxis imaging slices, as reported by Jung et al. (15).In this work we propose a k-space trajectory estimation method based on an anisotropic gradient delay model (16) and a simple eddy current model. Our goal is to...

show abstract

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Tan

Bansal

2020

View full text Add to dashboard Cite

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple purelanguage tasks such as GLUE, SQuAD, and SWAG. 1 listening learn Vokenization Humans learn language by listening, speaking ... humans Language Input BERT Transformer Model Masked Language Model [MASK] language by [MASK] speaking humans Language Input BERT Transformer Model Voken Classification Task [MASK] language by [MASK] speaking Masked Tokens Vokens (Token-Related Images)

show abstract

Diagnosing the Environment Bias in Vision-and-Language Navigation

Zhang

Tan

Bansal

2020

View full text Add to dashboard Cite

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. These step-by-step navigational instructions are crucial when the agent is navigating new environments about which it has no prior knowledge. Most recent works that study VLN observe a significant performance drop when tested on unseen environments (i.e., environments not used in training), indicating that the neural agent models are highly biased towards training environments. Although this issue is considered as one of the major challenges in VLN research, it is still under-studied and needs a clearer explanation. In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias. We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by ResNet features directly affects the agent model and contributes to this environment bias in results. According to this observation, we explore several kinds of semantic representations that contain less low-level visual information, hence the agent learned with these features could be better generalized to unseen testing environments. Without modifying the baseline agent model and its training method, our explored semantic features significantly decrease the performance gaps between seen and unseen on multiple datasets (i.e. R2R, R4R, and CVDN) and achieve competitive unseen results to previous state-of-the-art models.

show abstract

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Tan

Bansal

2019

Preprint

View full text Add to dashboard Cite

12 3 4 5 6

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hao Tan

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Estimation of k‐space trajectories in spiral MRI

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Diagnosing the Environment Bias in Vision-and-Language Navigation

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

Contact Info

Product

Resources

About