2021
DOI: 10.1613/jair.1.11688
|View full text |Cite
|
Sign up to set email alerts
|

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Abstract: Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this s… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 70 publications
(42 citation statements)
references
References 246 publications
0
37
0
Order By: Relevance
“…Alternatively, the class identifier might be used as additional metadata information, concatenated to the images' internal features representation in the CNN, and fed to a further shallow neural network for improved classification (see [9]). Another idea would be to consider further trends in the integration of vision and language research [8].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Alternatively, the class identifier might be used as additional metadata information, concatenated to the images' internal features representation in the CNN, and fed to a further shallow neural network for improved classification (see [9]). Another idea would be to consider further trends in the integration of vision and language research [8].…”
Section: Discussionmentioning
confidence: 99%
“…Experiment 8: more layers In order to increase the overall performance, we tried to increase the number of FC layers to 3x4k (8). However, taking as reference the performance of configuration (3), we could not observe a significant improvement by introducing an additional 4k FC layer to the classification stage.…”
Section: Methodsmentioning
confidence: 99%
“…Considering that, we will only study instruction following for robotic manipulation in this work. These review papers [17], [18] well describe existing studies about Vision-and-Language Navigation. This section will first review existing symbolic and connectionist methods for human instruction following.…”
Section: A Human Instrution Followingmentioning
confidence: 99%
“…Such a task was recently formalized adopting data-driven methods [4] with the release of the R2R dataset. In this setup, the VLN task [14] is addressed using Long Short Term Memory (LSTM) networks structured in an encoder-decoder framework. An instruction is encoded first and then decoded as a sequence of actions using the current environment states.…”
Section: Related Workmentioning
confidence: 99%