MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Li, Mingxing; Zhang, Hao; Xu, Cheng; Yan, Chenyang; Liu, Hongzhe; Li, Xuewei

doi:10.3390/electronics11192999

Cited by 2 publications

(1 citation statement)

References 25 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, we propose converting traffic scene keyframes into natural language captions and using richer semantic information can replace detecting individual entities. This approach shows promise for assisting visually impaired individuals [12,23], driving safety [1], and describing traffic accidents [18].…”

Section: Introductionmentioning

confidence: 99%

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Zhang,

Xu,

et al. 2024

ITC

View full text Add to dashboard Cite

Image captioning in traffic scenes presents several challenges, including imprecise caption generation, lack of personalization, and an unwieldy number of model parameters. We propose a new image captioning model for traffic scenes to address these issues. The model incorporates an adapter-based fine-tuned feature extraction part to enhance personalization and a caption generation module using global weighted attention pooling to reduce model parameters and improve accuracy. The proposed model consists of four main stages. In the first stage, the Image-Encoder extracts the global features of the input image and divides it into nine sub-regions, encoding each sub-region separately. In the second stage, the Text-Encoder encodes the text dataset to obtain text features. It then calculates the similarity between the image sub-region features and encoded text features, selecting the text features with the highest similarity. Subsequently, the pre-trained Faster RCNN model extracts local image features. The model then splices together the text features, global image features, and local image features to fuse the multimodal information. In the final stage, the extracted features are fed into the Captioning model, which effectively fuses the different features using a novel global weighted attention pooling layer. The Captioning model then generates natural language image captions. The proposed model is evaluated on the MS-COCO dataset, Flickr 30K dataset, and BUUISE-Image dataset, using mainstream evaluation metrics. Experiments demonstrate significant improvements across all evaluation metrics on the public datasets and strong performance on the BUUISE-Image traffic scene dataset.

show abstract

Section: Introductionmentioning

confidence: 99%

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Zhang,

Xu,

et al. 2024

ITC

View full text Add to dashboard Cite

show abstract

Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks

Zhao

et al. 2023

Drones

View full text Add to dashboard Cite

In complex mining environments, driverless mining trucks are required to cooperate with multiple intelligent systems. They must perform obstacle avoidance based on factors such as the site road width, obstacle type, vehicle body movement state, and ground concavity-convexity. Targeting the open-pit mining area, this paper proposes an intelligent mining road object detection (IMOD) model developed using a 5G-multi-UAV and a deep learning approach. The IMOD model employs data sensors to monitor surface data in real time within a multisystem collaborative 5G network. The model transmits data to various intelligent systems and edge devices in real time, and the unmanned mining card constructs the driving area on the fly. The IMOD model utilizes a convolutional neural network to identify obstacles in front of driverless mining trucks in real time, optimizing multisystem collaborative control and driverless mining truck scheduling based on obstacle data. Multiple systems cooperate to maneuver around obstacles, including avoiding static obstacles, such as standing and lying dummies, empty oil drums, and vehicles; continuously avoiding multiple obstacles; and avoiding dynamic obstacles such as walking people and moving vehicles. For this study, we independently collected and constructed an obstacle image dataset specific to the mining area, and experimental tests and analyses reveal that the IMOD model maintains a smooth route and stable vehicle movement attitude, ensuring the safety of driverless mining trucks as well as of personnel and equipment in the mining area. The ablation and robustness experiments demonstrate that the IMOD model outperforms the unmodified YOLOv5 model, with an average improvement of approximately 9.4% across multiple performance measures. Additionally, compared with other algorithms, this model shows significant performance improvements.

show abstract

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Cited by 2 publications

References 25 publications

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip

Intelligent Mining Road Object Detection Based on Multiscale Feature Fusion in Multi-UAV Networks

Contact Info

Product

Resources

About