Manuele Barraco scite author profile

Manuele Barraco

4Publications

27Citation Statements Received

267Citation Statements Given

How they've been cited

How they cite others

179

265

Affiliations

University of Modena and Reggio Emilia, Ferrari (Italy)

Publications

Order By: Most citations

The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

Barraco

Cornia

Cascianelli

et al. 2022

View full text Add to dashboard Cite

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

show abstract

CaMEL: Mean Teacher Learning for Image Captioning

Barraco

Stefanini

Cornia

et al. 2022

View full text Add to dashboard Cite

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Moratelli

Barraco

Morelli

et al. 2023

Sensors

View full text Add to dashboard Cite

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

show abstract

CaMEL: Mean Teacher Learning for Image Captioning

Barraco¹,

Stefanini²,

Cornia³

et al. 2022

Preprint

View full text Add to dashboard Cite

Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors. When comparing with existing proposals, we demonstrate that our model provides stateof-the-art caption quality with a significantly reduced number of parameters. According to the CIDEr metric, we obtain a new state of the art on COCO when training without using external data. The source code and trained models are publicly available at: https://github.com/aimagelab/camel.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Manuele Barraco

The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

CaMEL: Mean Teacher Learning for Image Captioning

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

CaMEL: Mean Teacher Learning for Image Captioning

Contact Info

Product

Resources

About