Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Mei, Hongyuan; Bansal, Mohit; Walter, Matthew R.

doi:10.48550/arxiv.1506.04089

Cited by 54 publications

(10 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to better address the selective generation task, we propose a coarse-tofine aligner that prevents the model from being distracted by non-salient records. Our model aligns based on multiple abstractions of the input: both the original input record as well as the hidden annotations m j = (r j ; h j ) , an approach that has previously been shown to yield better results than aligning based only on the hidden state (Mei et al, 2015).…”

Section: The Modelmentioning

confidence: 99%

What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment

Mei

Bansal

Walter

2016

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

229

250

View full text Add to dashboard Cite

We propose an end-to-end, domainindependent neural encoder-aligner-decoder model for selective generation, i.e., the joint task of content selection and surface realization. Our model first encodes a full set of over-determined database event records via an LSTM-based recurrent neural network, then utilizes a novel coarse-to-fine aligner to identify the small subset of salient records to talk about, and finally employs a decoder to generate free-form descriptions of the aligned, selected records. Our model achieves the best selection and generation results reported to-date (with 59% relative improvement in generation) on the benchmark WEATHER-GOV dataset, despite using no specialized features or linguistic resources. Using an improved k-nearest neighbor beam filter helps further. We also perform a series of ablations and visualizations to elucidate the contributions of our key model components. Lastly, we evaluate the generalizability of our model on the ROBOCUP dataset, and get results that are competitive with or better than the state-of-the-art, despite being severely data-starved.

show abstract

Section: The Modelmentioning

confidence: 99%

What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment

Mei

Bansal

Walter

2016

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

229

250

View full text Add to dashboard Cite

show abstract

“…This work focuses on exploiting multi-temporal, multi-spectral and spatial information together for improving land cover mapping through the use of RNNs. Recently, RNNs have been demonstrated to achieve significant results on sequential data and have been applied in different fields like natural language processing [30], [34], [20], computer vision [32], [16], [39], multi-modal [22], [11], [15] and robotics [28]. RNNs have been applied on various applications such as language modeling, speech recognition, machine translation, question answering, object recognition, visual tracking, video analysis, image generation, image captioning, video captioning, self driving car, fraud detection, prediction models, sentimental classification, among others.…”

Section: Introductionmentioning

confidence: 99%

Land cover classification from multi-temporal, multi-spectral remotely sensed imagery using patch-based recurrent neural networks

2018

View full text Add to dashboard Cite

Environmental sustainability research is dependent on accurate land cover information. Even with the increased number of satellite systems and sensors acquiring data with improved spectral, spatial, radiometric and temporal characteristics and the new data distribution policy, most existing land cover datasets are derived from a pixel-based, single-date multi-spectral remotely sensed image with an unacceptable accuracy. One major bottleneck for accuracy improvement is how to develop an accurate and effective image classification protocol. By incorporating and utilizing multi-spectral, multi-temporal and spatial information in remote sensing images and considering the inherit spatial and sequential interdependence among neighboring pixels, we propose a new patch-based recurrent neural network (PB-RNN) system tailored for classifying multi-temporal remote sensing data. The system is designed by incorporating distinctive characteristics of multi-temporal remote sensing data. In particular, it uses multi-temporal-spectral-spatial samples and deals with pixels contaminated by clouds/shadow present in multi-temporal data series. Using a Florida Everglades ecosystem study site covering an area of 771 square kilometers, the proposed PB-RNN system has achieved a significant improvement in the classification accuracy over a pixel-based recurrent neural network (RNN) system, a pixel-based single-image neural network (NN) system, a pixel-based multi-image NN system, a patch-based single-image NN system, and a patch-based multi-image NN system. For example, the proposed system achieves 97.21% classification accuracy while the pixel-based single-image NN system achieves 64.74%. By utilizing methods like the proposed PB-RNN one, we believe that much more accurate land cover datasets can be produced over large areas.

show abstract

“…Our basic model for RUN is a sequence-tosequence model similar to the work of Mei et al (2015) on SAIL, and inspired by Xu et al (2015). It is based on Conditioned Generation with Attention (CGA).…”

Section: Models For Runmentioning

confidence: 99%

RUN through the Streets: A New Dataset and Baseline Models for Realistic Urban Navigation

Paz-Argaman¹,

Tsarfaty

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Following navigation instructions in natural language requires a composition of language, action, and knowledge of the environment. Knowledge of the environment may be provided via visual sensors or as a symbolic world representation referred to as a map. Here we introduce the Realistic Urban Navigation (RUN) task, aimed at interpreting navigation instructions based on a real, dense, urban map. Using Amazon Mechanical Turk, we collected a dataset of 2515 instructions aligned with actual routes over three regions of Manhattan. We propose a strong baseline for the task and empirically investigate which aspects of the neural architecture are important for the RUN success. Our results empirically show that entity abstraction, attention over words and worlds, and a constantly updating world-state, significantly contribute to task accuracy.

show abstract

Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences

Cited by 54 publications

References 27 publications

What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment

What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment

Land cover classification from multi-temporal, multi-spectral remotely sensed imagery using patch-based recurrent neural networks

RUN through the Streets: A New Dataset and Baseline Models for Realistic Urban Navigation

Contact Info

Product

Resources

About