Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Dognin, Pierre; Melnyk, Igor; Mroueh, Youssef; Padhi, Inkit; Rigotti, Mattia; Ross, Jarret; Schiff, Yair; Young, Richard A.; Belgodere, Brian

doi:10.1613/jair.1.13113

Cited by 13 publications

(7 citation statements)

References 41 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, all unseen models are considered as one class. 9 In our evaluation, we treat DALL•E 2 as one unseen model (as mentioned before). We first divide the datasets into training, validation, and testing parts.…”

Section: Discussionmentioning

confidence: 99%

“…Here, we explore whether the quality of the BLIP-generated prompts affects the detection performance. To measure the quality of the generated prompts by BLIP, we leverage a new term called prompt descriptiveness [9,10,23,35]. Prompt descriptiveness can be quantitatively measured by computing the cosine similarity between a prompt's embedding and its image's embedding generated by CLIP.…”

Section: Ablation Studymentioning

confidence: 99%

See 1 more Smart Citation

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Sha,

Li,

et al. 2023

Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

View full text Add to dashboard Cite

Text-to-image generation models that generate images based on prompt descriptions have attracted an increasing amount of attention during the past few months. Despite their encouraging performance, these models raise concerns about the misuse of their generated fake images. To tackle this problem, we pioneer a systematic study on the detection and attribution of fake images generated by text-to-image generation models. Concretely, we first build a machine learning classifier to detect the fake images generated by various text-to-image generation models. We then attribute these fake images to their source models, such that model owners can be held responsible for their models' misuse. We further investigate how prompts that generate fake images affect detection and attribution. We conduct extensive experiments on four popular textto-image generation models, including DALL•E 2, Stable Diffusion, GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical results show that (1) fake images generated by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models;(2) fake images can be effectively attributed to their source models, as different models leave unique fingerprints in their generated images; (3) prompts with the "person" topic or a length between 25 and 75 enable models to generate fake images with higher authenticity. All findings contribute to the community's insight into the threats caused by text-to-image generation models. We appeal to the community's consideration of the counterpart solutions, like ours, against the rapidly-evolving fake image generation. 1 CCS CONCEPTS• Security and privacy → Social aspects of security and privacy.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Ablation Studymentioning

confidence: 99%

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Sha,

Li,

et al. 2023

Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

View full text Add to dashboard Cite

show abstract

“…This technological advancement serves to connect visual and textual data, so enabling deeper understanding of image content and creating opportunities for diverse applications [1]. The field of image captioning is gaining considerable interest owing to its capacity to boost image accessibility, assist individuals with visual impairments, automate content creation, and enhance image retrieval systems [2]. Especially in video summarization, image caption generation is a potent tool with uses that go beyond individual images.…”

Section: Introductionmentioning

confidence: 99%

Image Caption Generation using Deep Learning For Video Summarization Applications

Inayathulla,

2024

IJACSA

View full text Add to dashboard Cite

In the area of video summarization applications, automatic image caption synthesis using deep learning is a promising approach. This methodology utilizes the capabilities of neural networks to autonomously produce detailed textual descriptions for significant frames or instances in a video. Through the examination of visual elements, deep learning models possess the capability to discern and classify objects, scenarios, and actions, hence enabling the generation of coherent and useful captions. This paper presents a novel methodology for generating image captions in the context of video summarizing applications. DenseNet201 architecture is used to extract image features, enabling the effective extraction of comprehensive visual information from keyframes in the videos. In text processing, GloVe embedding, which is pre-trained word vectors that capture semantic associations between words, is employed to efficiently represent textual information. The utilization of these embeddings establishes a fundamental basis for comprehending the contextual variations and semantic significance of words contained within the captions. LSTM models are subsequently utilized to process the GloVe embeddings, facilitating the development of captions that keep coherence, context, and readability. The integration of GloVe embeddings with LSTM models in this study facilitates the effective fusion of visual and textual data, leading to the generation of captions that are both informative and contextually relevant for video summarization. The proposed model significantly enhances the performance by combining the strengths of convolutional neural networks for image analysis and recurrent neural networks for natural language generation. The experimental results demonstrate the effectiveness of the proposed approach in generating informative captions for video summarization, offering a valuable tool for content understanding, retrieval, and recommendation.

show abstract

“…With the continuous development of computer vision technology, sports video analysis technology has been widely used in the event analysis of sports competitions. It can provide athletes and coaches with corresponding data as a reference through video analysis and make a relatively systematic evaluation of individual athletes' and groups' performance in sports competitions [ 1 ]. In recent years, the number of sports videos has increased geometrically, and at the same time, there is a large amount of interference information in the huge amount of sports videos [ 2 ].…”

Section: Introductionmentioning

confidence: 99%

Application of Human Posture Recognition Based on the Convolutional Neural Network in Physical Training Guidance

Wang

2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

The application of sports game video analysis in athlete training and competition analysis feedback has attracted extensive attention, but the traditional sports human body posture estimation method has a large error between the athlete’s human body posture estimation results and the actual results in the complex environment and the athlete’s body parts are blocked. Therefore, this study proposes a convolutional neural network for athlete pose estimation in sports game video. Based on the improved model, multiscale model, and large perception model, a superimposed hourglass network is constructed, and the gradient disappearance problem of the convolutional neural network is solved using intermediate supervision. The experimental results show that the athlete pose estimation model based on the convolutional neural network can improve the accuracy of athlete pose estimation and reduce the negative impact of occlusion environment on athlete pose estimation to a certain extent. In addition, compared with other athletes’ standing posture estimation methods, the model has competitive advantages and high accuracy under widely used standard conditions.

show abstract

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Cited by 13 publications

References 41 publications

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models

Image Caption Generation using Deep Learning For Video Summarization Applications

Application of Human Posture Recognition Based on the Convolutional Neural Network in Physical Training Guidance

Contact Info

Product

Resources

About