Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR 2 , and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pretraining strategies significantly contribute to our strong results. 1
Referring expressions are natural language constructions used to identify particular objects within a scene. In this paper, we propose a unified framework for the tasks of referring expression comprehension and generation. Our model is composed of three modules: speaker, listener, and reinforcer. The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions. The listenerspeaker modules are trained jointly in an end-to-end learning framework, allowing the modules to be aware of one another during learning while also benefiting from the discriminative reinforcer's feedback. We demonstrate that this unified framework and training achieves state-of-theart results for both comprehension and generation on three referring expression datasets. Project and demo page: https://vision.cs.unc.edu/refer.
A grand goal in AI is to build a robot that can accurately navigate based on natural language instructions, which requires the agent to perceive the scene, understand and ground language, and act in the real-world environment. One key challenge here is to learn to navigate in new environments that are unseen during training. Most of the existing approaches perform dramatically worse in unseen environments as compared to seen ones. In this paper, we present a generalizable navigational agent. Our agent is trained in two stages. The first stage is training via mixed imitation and reinforcement learning, combining the benefits from both off-policy and on-policy optimization. The second stage is fine-tuning via newly-introduced 'unseen' triplets (environment, path, instruction). To generate these unseen triplets, we propose a simple but effective 'environmental dropout' method to mimic unseen environments, which overcomes the problem of limited seen environment variability. Next, we apply semi-supervised learning (via back-translation) on these droppedout environments to generate new paths and instructions. Empirically, we show that our agent is substantially better at generalizability when fine-tuned with these triplets, outperforming the state-of-art approaches by a large margin on the private unseen test set of the Room-to-Room task, and achieving the top rank on the leaderboard. 1
For non-Cartesian data acquisition in MRI, k-space trajectory infidelity due to eddy current effects and other hardware imperfections will blur and distort the reconstructed images. Even with the shielded gradients and eddy current compensation techniques of current scanners, the deviation between the actual k-space trajectory and the requested trajectory remains a major reason for image artifacts in non-Cartesian MRI. It is often not practical to measure the k-space trajectory for each imaging slice. It has been reported that better image quality is achieved in radial scanning by correcting anisotropic delays on different physical gradient axes. In this article the delay model is applied in spiral k-space trajectory estimation to reduce image artifacts. Then a novel estimation method combining the anisotropic delay model and a simple convolution eddy current model further reduces the artifact level in spiral image reconstruction. The root mean square error and peak error in both phantom and in vivo images reconstructed using the estimated Key words: MRI; spiral imaging; eddy currents; k-space trajectoryIn MRI the theoretical k-space trajectory is proportional to the integral of the gradient current through each gradient coil. However, the actual k-space trajectory is always distorted by many undesired effects in spatial encoding such as eddy currents and anisotropic gradient amplifier delays. To reduce the effects of eddy currents, manufacturers have active shielding and pre-emphasis filters in current scanners to eliminate most of the errors. However, the residual error can still cause severe image artifacts, especially in non-Cartesian scanning such as radial and spiral imaging (1-3).If the k-space trajectory is not distorted severely and the k-space center is sampled, we can use the actual k-space trajectory in the reconstruction and remove most of the artifacts. (11) proposed magnetic field monitoring (MFM) during the MRI data acquisition using field probes. This method is very promising since it can remove undesired phase terms in each individual scan. One limitation is that the probes have to be aligned with each imaging slice.Aside from uncompensated eddy current effects leading to distortions of the gradient waveform shape, another significant problem is small timing delay errors arising in the hardware. Peters et al. (12) measured the delays on different physical gradient axis in a calibration scan and corrected the miscentering of k-space on each projection angle using that information. Davies and Jezzard (13) proposed calibration and correction methods for gradient propagation delays for 2D RF pulses to improve the positional accuracy. Speier and Trautwein (14) parameterized the gradient delays for radial imaging applications. There are also frequency demodulation delays especially in offaxis imaging slices, as reported by Jung et al. (15).In this work we propose a k-space trajectory estimation method based on an anisotropic gradient delay model (16) and a simple eddy current model. Our goal is to...
Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple purelanguage tasks such as GLUE, SQuAD, and SWAG. 1 listening learn Vokenization Humans learn language by listening, speaking ... humans Language Input BERT Transformer Model Masked Language Model [MASK] language by [MASK] speaking humans Language Input BERT Transformer Model Voken Classification Task [MASK] language by [MASK] speaking Masked Tokens Vokens (Token-Related Images)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.