2023
DOI: 10.3390/drones7020114
|View full text |Cite
|
Sign up to set email alerts
|

Semantic Scene Understanding with Large Language Models on Unmanned Aerial Vehicles

Abstract: Unmanned Aerial Vehicles (UAVs) are able to provide instantaneous visual cues and a high-level data throughput that could be further leveraged to address complex tasks, such as semantically rich scene understanding. In this work, we built on the use of Large Language Models (LLMs) and Visual Language Models (VLMs), together with a state-of-the-art detection pipeline, to provide thorough zero-shot UAV scene literary text descriptions. The generated texts achieve a GUNNING Fog median grade level in the range of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 26 publications
(14 citation statements)
references
References 14 publications
0
5
0
Order By: Relevance
“…Regarding camouflaged objects, we not only utilize binary ground truth data but also introduce additional training information, such as ranking information and relevant masks [45]. Large Language Models (LLMs) and Visual Language Models (VLMs) [46,47] can provide rich semantic information to models, including object names, context, and more. By integrating this information, we can gain a deeper understanding of the model's decision-making process and interpret the model's understanding and reasoning about scenes, thereby endowing our model with stronger semantic scene understanding capabilities.…”
Section: Discussionmentioning
confidence: 99%
“…Regarding camouflaged objects, we not only utilize binary ground truth data but also introduce additional training information, such as ranking information and relevant masks [45]. Large Language Models (LLMs) and Visual Language Models (VLMs) [46,47] can provide rich semantic information to models, including object names, context, and more. By integrating this information, we can gain a deeper understanding of the model's decision-making process and interpret the model's understanding and reasoning about scenes, thereby endowing our model with stronger semantic scene understanding capabilities.…”
Section: Discussionmentioning
confidence: 99%
“…VLMs excel at generating textual descriptions for images, offering detailed insights into objects, people, and events within images. They can also perform imagerelated tasks such as classification, object detection, and generating grammatically correct image captions [40].…”
Section: Discussionmentioning
confidence: 99%
“…Generative Pre-trained Transformer (GPT) models, built upon a foundational architecture, play a pivotal role in advancing NLP capabilities [33]. These models have significantly transformed the NLP landscape, demonstrating an exceptional ability to capture and comprehend intricate linguistic structures, context, and semantic nuances [34].…”
Section: The Evolution Of Generative Pre-trained Transformer (Gpt) Mo...mentioning
confidence: 99%