Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7% to 11.7%).
Many deep learning architectures have been proposed to model the compositionality in text sequences, requiring a substantial number of parameters and expensive computations. However, there has not been a rigorous evaluation regarding the added value of sophisticated compositional functions. In this paper, we conduct a point-by-point comparative study between Simple Word-Embeddingbased Models (SWEMs), consisting of parameter-free pooling operations, relative to word-embedding-based RNN/CNN models. Surprisingly, SWEMs exhibit comparable or even superior performance in the majority of cases considered. Based upon this understanding, we propose two additional pooling strategies over learned word embeddings: (i) a max-pooling operation for improved interpretability; and (ii) a hierarchical pooling operation, which preserves spatial (n-gram) information within text sequences. We present experiments on 17 datasets encompassing three tasks: (i) (long) document classification; (ii) text sequence matching; and (iii) short text tasks, including classification and tagging. The source code and datasets can be obtained from https:// github.com/dinghanshen/SWEM.
Word embeddings are effective intermediate representations for capturing semantic regularities between words, when learning the representations of text sequences. We propose to view text classification as a label-word joint embedding problem: each label is embedded in the same space with the word vectors. We introduce an attention framework that measures the compatibility of embeddings between text sequences and labels. The attention is learned on a training set of labeled samples to ensure that, given a text sequence, the relevant words are weighted higher than the irrelevant ones. Our method maintains the interpretability of word embeddings, and enjoys a built-in ability to leverage alternative sources of information, in addition to input text sequences. Extensive results on the several large text datasets show that the proposed framework outperforms the state-of-the-art methods by a large margin, in terms of both accuracy and speed. U B Q r r d W a j 2 Y 9 N 2 M b D h P x o T D / 4 R L x 6 0 u 3 I Q c J K m L + + 9 a W e m D W L O j P X 9 D 6 + w s r q 2 v l H c L G 2 V t 3 d 2 K 3 v 7 D 0 Y l m k K L K q 5 0 O y A G O J P Q s s x y a M c a i A g 4 P A b P 1 5 n + O A J t m J L 3 d h x D T 5 C B Z B G j x D q q X 3 n q S n i h S g g i w 7 R 7 J 4 i d p m k 3 i P D d d F q a 0 w K w Z J S L i o d m L N y G c 3 L R O F l 0 T T L H Z N S v V P 2 a n w d e B v U Z q K J Z N P u V t 2 6 o a C J A W s q J M Z 2 6 H 9 t e S r R l l I M 7 M z E Q E / p M B t B x U B I B p p f m M 5 n i Y 8 e E O F L a L W l x z v 7 N S I k w W X 3 O 6 Z o e m k U t I / / T O o m N L n o p k 3 F i Q d L f i 6 K E Y 6 t w N m A c M g 3 U 8 r E D h G r m a s V 0 S D S h 1 j 1 D y Q 2 h v t j y M m i d 1 i 5 r / u 1 Z t X E 1 m 0 Y R H a I j d I L q 6 B w 1 0 A 1 q o h a i 6 B V 9 o m 8 P e e / e U B Q r r d W a j 2 Y 9 N 2 M b D h P x o T D / 4 R L x 6 0 u 3 I Q c J K m L + + 9 a W e m D W L O j P X 9 D 6 + w s r q 2 v l H c L G 2 V t 3 d 2 K 3 v 7 D 0 Y l m k K L K q 5 0 O y A G O J P Q s s x y a M c a i A g 4 P A b P 1 5 n + O A J t m J L 3 d h x D T 5 C B Z B G j x D q q X 3 n q S n i h S g g i w 7 R 7 J 4 i d p m k 3 i P D d d F q a 0 w K w Z J S L i o d m L N y G c 3 L R O F l 0 T T L H Z N S v V P 2 a n w d e B v U Z q K J Z N P u V t 2 6 o a C J A W s q J M Z 2 6 H 9 t e S r R l l I M 7 M z E Q E / p M B t B x U B I B p p f m M 5 n i Y 8 e E O F L a L W l x z v 7 N S I k w W X 3 O 6 Z o e m k U t I / / T O o m N L n o p k 3 F i Q d L f i 6 K E Y 6 t w N m A c M g 3 U 8 r E D h G r m a s V 0 S D S h 1 j 1 D y Q 2 h v t j y M m i d 1 i 5 r / u 1 Z t X E 1 m 0 Y R H a I j d I L q 6 B w 1 0 A 1 q o h a i 6 B V 9 o m 8 P e e / e U B Q r r d W a j 2 Y 9 N 2 M b D h P x o T D / 4 R L x 6 0 u 3 I Q c J K m L + + 9 a W e m D W L O j P X 9 D 6 + w s r q 2 v l H c L G 2 V t 3 d 2 K 3 v 7 D 0 Y l m k K L K q 5 0 O y A G O J P Q s s x y a M c a i A g 4 P A b P 1 5 n + O A J t m J L 3 d h x D T 5 C B Z B G j x D q q X 3 n q S n i h S g g i w 7 R 7 J 4 i d p m k 3 i P D d d F q a 0 w K w Z J S L i o d m L N y G c 3 L R O F l 0 T T L H Z N S v V ...
Semantic hashing has become a powerful paradigm for fast similarity search in many information retrieval systems. While fairly successful, previous techniques generally require two-stage training, and the binary constraints are handled ad-hoc. In this paper, we present an end-to-end Neural Architecture for Semantic Hashing (NASH), where the binary hashing codes are treated as Bernoulli latent variables. A neural variational inference framework is proposed for training, where gradients are directly backpropagated through the discrete latent variable to optimize the hash function. We also draw connections between proposed method and rate-distortion theory, which provides a theoretical foundation for the effectiveness of the proposed framework. Experimental results on three public datasets demonstrate that our method significantly outperforms several state-of-the-art models on both unsupervised and supervised scenarios.
GPT-3 (Brown et al., 2020) has attracted lots of attention due to its superior performance across a wide range of NLP tasks, especially with its powerful and versatile in-context fewshot learning ability. Despite its success, we found that the empirical results of GPT-3 depend heavily on the choice of in-context examples. In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples (relative to random sampling) that better leverage GPT-3's few-shot capabilities. Inspired by the recent success of leveraging a retrieval module to augment large-scale neural network models, we propose to retrieve examples that are semantically-similar to a test sample to formulate its corresponding prompt. Intuitively, the in-context examples selected with such a strategy may serve as more informative inputs to unleash GPT-3's extensive knowledge. We evaluate the proposed approach on several natural language understanding and generation benchmarks, where the retrieval-based prompt selection approach consistently outperforms the random baseline. Moreover, it is observed that the sentence encoders finetuned on task-related datasets yield even more helpful retrieval results. Notably, significant gains are observed on tasks such as table-totext generation (41.9% on the ToTTo dataset) and open-domain question answering (45.5% on the NQ dataset). We hope our investigation could help understand the behaviors of GPT-3 and large-scale pre-trained LMs in general and enhance their few-shot capabilities. * Work was done during an internship at Microsoft Dynamics 365 AI.
We propose a topic-guided variational autoencoder (TGVAE) model for text generation. Distinct from existing variational autoencoder (VAE) based approaches, which assume a simple Gaussian prior for the latent code, our model specifies the prior as a Gaussian mixture model (GMM) parametrized by a neural topic module. Each mixture component corresponds to a latent topic, which provides guidance to generate sentences under the topic. The neural topic module and the VAE-based neural sequence module in our model are learned jointly. In particular, a sequence of invertible Householder transformations is applied to endow the approximate posterior of the latent code with high flexibility during model inference. Experimental results show that our TGVAE outperforms alternative approaches on both unconditional and conditional text generation, which can generate semantically-meaningful sentences with various topics. µ log 2 ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 U n l 4 N N J R P X a u u g M u c 4 E V m Z 8 m A 8 = " > A A A B + n i c b V B P S 8 M w H E 3 n v z n / 1 X n 0 E h y C p 9 G K o N 6 G X j x O s D p Y y 0 j T d A t L 0 5 L 8 K o 6 y r + L F g 4 p X P 4 k 3 v 4 3 p 1 o N u P g h 5 v P f 7 k Z c X Z o J r c J x v q 7 a y u r a + U d 9 s b G 3 v 7 O 7 Z + 8 1 7 n e a K M o + m I l W 9 k G g m u G Q e c B C s l y l G k l C w h 3 B 8 X f o P j 0 x p n s o 7 m G Q s S M h Q 8 p h T A k Y a 2 E 0 / T E W k J 4 m 5 s A 8 j B m R g t 5 y 2 M w N e J m 5 F W q h C d 2 B / + V F K 8 4 R J o I J o 3 Neural Topic Model (NTM) d t < l a t e x i t s h a 1 _ b a s e 6 4 = " u o 7 N n e 9 4 D t v o M i f d 7 m u o d O j 9 Q X M = " > A A A B 8 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 k V Q b 0 V v X i s 4 G q h X U o 2 m 2 1 D s 8 m a Z A t l 6 e / w 4 k H F q 3 / G m / / G b L s H b R 0 I e b w 3 w 7 x 5 Y c q Z N q 7 7 7 V R W V t f W N 6 q b t a 3 t n d 2 9 + v 7 B g 5 a Z I t Q n k k v V C b G m n A n q G 2 Y 4 7 a S K 4 i T k 9 D E c 3 R T 6 4 5 g q z a S 4 N 5 O U B g k e C B Y z g o 2 l g l 4 o e a Q n i f 24 K 3 e P I y 8 M + a V 0 3 v 7 r z R u i 7 T q M I R H M M p e H A B L b i F N v h A 4 A m e 4 R X e n L H z 4 r w 7 H / P W i l P O H M K f c j 5 / A E S M k f c = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " u o 7 N n e 9 4 D t v o M i f d 7 m u o d O j 9 Q X M = " > A A A B 8 3 i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B U 9 k V Q b 0 V v X i s 4 G q h X U o 2 m 2 1 D s 8 m a Z A t l 6 e / w 4 k H F q 3 / G m / / G b L s H b R 0 I e b w 3 w 7 x 5 Y c q Z N q 7 7 7 V R W V t f W N 6 q b t a 3 t n d 2 9 + v 7 B g 5 a Z I t Q n k k v V C b G m n A n q G 2 Y 4 7 a S K 4 i T k 9 D E c 3 R T 6 4 5 g q z a S 4 N 5 O U B g k e C B Y z g o 2 l g l 4 o e a Q n i f 2 Q 6 d c b b t O d F V o G X g k a U F a 7 X / / q R Z J k C R W G c K x 1 1 3 N T E + R Y G U Y 4 n d Z 6 m a Y p J i M 8 o F 0 L B U 6 o D v K Z 6 S k 6 s U y E Y q n s E w b N 2 N 8 T O U 5 0 4 c x 2 J t g M 9 a J W k P 9 p 3 c z E l 0 H O R J o Z K s h 8 U Z z Z + y Q q E k A R U 5 Q Y P r E A E 8 W s V 0 S G W G F...
Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space cannot be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.