Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Vasudevan, Arun Balajee; Dai, Dengxin; Gool, Luc Van

doi:10.1007/s11263-020-01374-3

Cited by 30 publications

(19 citation statements)

References 65 publications

(98 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Talk to Nav. Vasudevan et al [148] developed an interactive visual navigation environment based on Google Street View named Talk2Nav dataset with 10,714 routes, and an built effective model to create large-scale navigational instructions over long-range city environments. 16 http://streetlearn.cc Street View.…”

Section: Street View Navigationmentioning

confidence: 99%

Vision-Language Navigation: A Survey and Taxonomy

Wu¹,

Cui²,

Li³

2021

Preprint

View full text Add to dashboard Cite

An agent that can understand natural-language instruction and carry out corresponding actions in the visual world is one of the long-term challenges of Artificial Intelligent (AI). Due to multifarious instructions from humans, it requires the agent can combine natural language to vision and action in unstructured, previously unseen environments. If the instruction given by human is a navigation task, this challenge is called Visual-and-Language Navigation (VLN). It is a booming multidisciplinary field of increasing importance and with extraordinary practicality. Instead of focusing on the details of specific methods, this paper provides a comprehensive survey on VLN tasks and makes a classification carefully according the different characteristics of language instructions in these tasks. According to when the instructions are given, the tasks can be divided into single-turn and multi-turn. For single-turn tasks, we further divided them into goal-orientation and route-orientation based on whether the instructions contain a route. For multi-turn tasks, we divided them into imperative task and interactive task based on whether the agent responses to the instructions. This taxonomy enable researchers to better grasp the key point of a specific task and identify directions for future research.

show abstract

Section: Street View Navigationmentioning

confidence: 99%

Vision-Language Navigation: A Survey and Taxonomy

Wu¹,

Cui²,

Li³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…ere are a large number of variants of the K-means algorithm, including initialization optimization K-means++, distance calculation optimization Elkan K-means algorithm, and optimization Mini Batch K-means algorithm in the case of big data. e deterministic algorithm converts the landmark visual saliency problem into an optimization problem and converts the local pattern matching in the landmark visual saliency of the video sequence into a cost function minimization problem, the most representative deterministic algorithm is the K-means clustering algorithm, and the advantages of the K-means clustering algorithm are fast convergence, being used in the landmark visual saliency of the high frame rate, and being very suitable for the landmark visual saliency analysis of real-time scenes when a considerable number of landmark visual saliency algorithms are based on the improvement of K-means clustering algorithm [8]. However, the K-means clustering method also has its drawbacks; for example, it is difficult to cope with the scale change and shape change of the target in the landmark visual saliency model acquisition process, easy to be influenced by the similar background and the interference of light change, and easy to occur in the building surface clustering in the building surface process.…”

Section: Related Workmentioning

confidence: 99%

“…e artificial ant in the ant colony algorithm uses the overall information of the ant colony, and the global update of the residual pheromone is performed only after the completion of an optimization search. e pheromone update formula on each path in the ant colony algorithm is (8), where D(j) min is the intraclass distance when the objective function obtains the minimum value. v(wkj) is the pheromone increment; M is the total amount of pheromone released by ants; and D(j) (0 < D(j) < 1) is the pheromone volatility coefficient.…”

Section: Advances In Civil Engineeringmentioning

confidence: 99%

Construction of a Visual Saliency Model for Neighborhood Building Landmarks Based on K‐Means Clustering

Qiao

2021

Advances in Civil Engineering

View full text Add to dashboard Cite

In this paper, firstly, based on the quantitative relationship between K-means clustering and visual saliency of neighborhood building landmarks, the weights occupied by each index of composite visual factors are obtained by using multiple statistical regression methods, and, finally, we try to construct a saliency model of multiple visual index composites and analyze and test the model. As regards decomposition and quantification of visual saliency influencing factors, to describe and quantify these visual significance factors of the landmarks, the significant factors are decomposed into several quantifiable secondary indicators. Considering that the visual saliency of the landmarks in the neighborhood is reflected by the variance of the influencing factors and that the scope of the landmarks is localized, the local outlier detection algorithm is used to solve the variance of the secondary indicators. Since the visual significance of neighborhood building landmarks is influenced by a combination of influencing factors, the overall difference degree of secondary indicators is calculated by K-means clustering. To facilitate the factor calculation, a factor-controlled virtual environment was built to carry out the experimental study of landmark perception and calculate the different degrees of each index of the building. The data of visual indicators of the neighborhood buildings for this experiment were also collected, and the significance values of the neighborhood buildings were calculated. The influence weights of the indicators were obtained by using multiple linear regression analysis, the visual significance model of the landmarks of the neighborhood buildings in the factor-controlled environment was constructed, and the model was analyzed and tested.

show abstract

“…In recent years, researchers have investigated systems where passengers can give commands to self-driving cars. For instance, (Vasudevan, Dai, and Van Gool 2021;Chen et al 2019) consider navigational commands such as "Take the first left and at the red building turn right. Afterwards, drive to the white building".…”

Section: Introductionmentioning

confidence: 99%

Predicting Physical World Destinations for Commands Given to Self-Driving Cars

Grujicic

Deruyttere

Moens

et al. 2022

AAAI

View full text Add to dashboard Cite

In recent years, we have seen significant steps taken in the development of self-driving cars. Multiple companies are starting to roll out impressive systems that work in a variety of settings. These systems can sometimes give the impression that full self-driving is just around the corner and that we would soon build cars without even a steering wheel. The increase in the level of autonomy and control given to an AI provides an opportunity for new modes of human-vehicle interaction. However, surveys have shown that giving more control to an AI in self-driving cars is accompanied by a degree of uneasiness by passengers. In an attempt to alleviate this issue, recent works have taken a natural language-oriented approach by allowing the passenger to give commands that refer to specific objects in the visual scene. Nevertheless, this is only half the task as the car should also understand the physical destination of the command, which is what we focus on in this paper. We propose an extension in which we annotate the 3D destination that the car needs to reach after executing the given command and evaluate multiple different baselines on predicting this destination location. Additionally, we introduce a model that outperforms the prior works adapted for this particular setting.

show abstract

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Cited by 30 publications

References 65 publications

Vision-Language Navigation: A Survey and Taxonomy

Vision-Language Navigation: A Survey and Taxonomy

Construction of a Visual Saliency Model for Neighborhood Building Landmarks Based on K‐Means Clustering

Predicting Physical World Destinations for Commands Given to Self-Driving Cars

Contact Info

Product

Resources

About