HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Cafagna, Michele; van Deemter, Kees; Gatt, Albert

doi:10.18653/v1/2023.inlg-main.21

Proceedings of the 16th International Natural Language Generation Conference 2023

DOI: 10.18653/v1/2023.inlg-main.21

|View full text |Cite

HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Michele Cafagna,

Kees van Deemter,

Albert Gatt

Abstract: Current captioning datasets focus on objectcentric captions, describing the visible objects in the image, e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

Publication Types

Select...

Article1

Relationship

Self Cite1

Independent0

Authors

Journals

Cited by 1 publication

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Interpreting vision and language generative models with semantic visual priors

Cafagna,

Rojas-Barahona,

van Deemter

et al. 2023

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

When applied to Image-to-text models, explainability methods have two challenges. First, they often provide token-by-token explanations namely, they compute a visual explanation for each token of the generated sequence. This makes explanations expensive to compute and unable to comprehensively explain the model's output. Second, for models with visual inputs, explainability methods such as SHAP typically consider superpixels as features. Since superpixels do not correspond to semantically meaningful regions of an image, this makes explanations harder to interpret. We develop a framework based on SHAP, that allows for generating comprehensive, meaningful explanations leveraging the meaning representation of the output sequence as a whole. Moreover, by exploiting semantic priors in the visual backbone, we extract an arbitrary number of features that allows the efficient computation of Shapley values on large-scale models, generating at the same time highly meaningful visual explanations. We demonstrate that our method generates semantically more expressive explanations than traditional methods at a lower compute cost and that it can be generalized to a large family of vision-language models.

show abstract

Interpreting vision and language generative models with semantic visual priors

Cafagna,

Rojas-Barahona,

van Deemter

et al. 2023

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales

Cited by 1 publication

References 29 publications

Interpreting vision and language generative models with semantic visual priors

Interpreting vision and language generative models with semantic visual priors

Contact Info

Product

Resources

About