Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Mogadala, Aditya; Kalimuthu, Marimuthu; Klakow, Dietrich

doi:10.1613/jair.1.11688

Cited by 70 publications

(42 citation statements)

References 246 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Alternatively, the class identifier might be used as additional metadata information, concatenated to the images' internal features representation in the CNN, and fed to a further shallow neural network for improved classification (see [9]). Another idea would be to consider further trends in the integration of vision and language research [8].…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

A Competitive Deep Neural Network Approach for the ImageCLEFmed Caption 2020 Task

Kalimuthu,

Nunnari,

Sonntag

2020

Preprint

Self Cite

View full text Add to dashboard Cite

The aim of ImageCLEFmed Caption task is to develop a system that automatically labels radiology images with relevant medical concepts. We describe our Deep Neural Network (DNN) based approach for tackling this problem. On the challenge test set of 3,534 radiology images, our system achieves an F1 score of 0.375 and ranks high, 12th among all systems that were successfully submitted to the challenge, whereby we only rely on the provided data sources and do not use any external medical knowledge or ontologies, or pretrained models from other medical image repositories or application domains.

show abstract

Section: Discussionmentioning

confidence: 99%

“…Experiment 8: more layers In order to increase the overall performance, we tried to increase the number of FC layers to 3x4k (8). However, taking as reference the performance of configuration (3), we could not observe a significant improvement by introducing an additional 4k FC layer to the classification stage.…”

Section: Methodsmentioning

confidence: 99%

A Competitive Deep Neural Network Approach for the ImageCLEFmed Caption 2020 Task

Kalimuthu,

Nunnari,

Sonntag

2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Considering that, we will only study instruction following for robotic manipulation in this work. These review papers [17], [18] well describe existing studies about Vision-and-Language Navigation. This section will first review existing symbolic and connectionist methods for human instruction following.…”

Section: A Human Instrution Followingmentioning

confidence: 99%

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following

Xu¹,

Chen²,

Lin³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper investigates robot manipulation based on human instruction with ambiguous requests. The intent is to compensate for imperfect natural language via visual observations. Early symbolic methods, based on manually defined symbols, built modular framework consist of semantic parsing and task planning for producing sequences of actions from natural language requests. Modern connectionist methods employ deep neural networks to automatically learn visual and linguistic features and map to a sequence of low-level actions, in an endto-end fashion. These two approaches are blended to create a hybrid, modular framework: it formulates instruction following as symbolic goal learning via deep neural networks followed by task planning via symbolic planners. Connectionist and symbolic modules are bridged with Planning Domain Definition Language. The vision-and-language learning network predicts its goal representation, which is sent to a planner for producing a task-completing action sequence. For improving the flexibility of natural language, we further incorporate implicit human intents with explicit human instructions. To learn generic features for vision and language, we propose to separately pretrain vision and language encoders on scene graph parsing and semantic textual similarity tasks. Benchmarking evaluates the impacts of different components of, or options for, the vision-and-language learning model and shows the effectiveness of pretraining strategies. Manipulation experiments conducted in the simulator AI2THOR show the robustness of the framework to novel scenarios.

show abstract

“…Such a task was recently formalized adopting data-driven methods [4] with the release of the R2R dataset. In this setup, the VLN task [14] is addressed using Long Short Term Memory (LSTM) networks structured in an encoder-decoder framework. An instruction is encoded first and then decoded as a sequence of actions using the current environment states.…”

Section: Related Workmentioning

confidence: 99%

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Magassouba

Sugiura

Kawai

2021

Preprint

View full text Add to dashboard Cite

Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interacts naturally with users. This task involves the prediction of a sequence of actions that leads to a specified destination given a natural language navigation instruction. The task thus requires the understanding of instructions, such as "Walk out of the bathroom and wait on the stairs that are on the right". The Visual and Language Navigation remains challenging, notably because it requires the exploration of the environment and at the accurate following of a path specified by the instructions to model the relationship between language and vision. To address this, we propose the CrossMap Transformer network, which encodes the linguistic and visual features to sequentially generate a path. The CrossMap transformer is tied to a Transformer-based speaker that generates navigation instructions. The two networks share common latent features, for mutual enhancement through a double back translation model: Generated paths are translated into instructions while generated instructions are translated into path The experimental results show the benefits of our approach in terms of instruction understanding and instruction generation.

show abstract

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Cited by 70 publications

References 246 publications

A Competitive Deep Neural Network Approach for the ImageCLEFmed Caption 2020 Task

A Competitive Deep Neural Network Approach for the ImageCLEFmed Caption 2020 Task

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Contact Info

Product

Resources

About