Dual-level Collaborative Transformer for Image Captioning

Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider * Equal Contribution.

show abstract

“…Earlier Stage (two-stage model) RSTNet [52] ResNeXt-101 133.3 RSTNet [52] ResNeXt-152 135.6 DLCT [25] ResNeXt-101 133.8…”

Section: Methods Backbone Cidermentioning

confidence: 99%

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

Zeng

Zhu

Song

et al. 2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…To represent the visual input, CNN-based solutions have been proposed for extracting global features [18,36] or grids of features [26,48], and further improved through object detectors [2,27] for obtaining a region-based features representation, and self-attention. As for the language model, in earlier works it was implemented as a recurrent neural network [15,18,20,36], while more recent approaches employ Transformer-based fully-attentive models [5,8,28,56]. The success of this latter strategy has also encouraged the proposal of multi-modal early-fusion strategies [14,22,54], which proved the effectiveness of building a semantic representation of the image by exploiting also the text at the early stages of the captioning pipeline.…”

Section: Related Workmentioning

confidence: 99%

“…To this end, image representation plays a key role, making this aspect of great interest to the community working on image captioning and, in general, on tasks connecting vision and language. For years, image captioning approaches have relied on visual representations based on detected visual entities [2,27], among which relations have been modeled via graphs [49,51] or attention mechanisms [6,8,28,31].…”

Section: Introductionmentioning

confidence: 99%

The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

Barraco

Cornia

Cascianelli

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

Generating textual descriptions from visual inputs is a fundamental step towards machine intelligence, as it entails modeling the connections between the visual and textual modalities. For years, image captioning models have relied on pre-trained visual encoders and object detectors, trained on relatively small sets of data. Recently, it has been observed that large-scale multi-modal approaches like CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, provide a strong zero-shot capability on various vision tasks. In this paper, we study the advantage brought by CLIP in image captioning, employing it as a visual encoder. Through extensive experiments, we show how CLIP can significantly outperform widely-used visual encoders and quantify its role under different architectures, variants, and evaluation protocols, ranging from classical captioning performance to zero-shot transfer.

show abstract

“…CNN-based decoder has also been explored in Aneja et al (2018), showing on par performance but easier to train (e.g., better training efficiency, less likely to suffer from vanishing gradients), when compared with the prominent LSTM design. Of late, Transformer-based decoder (Herdade et al, 2019;Li et al, 2019b;Cornia et al, 2020;Luo et al, 2021b) has become the most popular design choice. • Multimodal fusion.…”

Section: Similar Trends In Captioning and Retrieval Modelsmentioning

confidence: 99%

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

Dual-level Collaborative Transformer for Image Captioning

Cited by 191 publications

References 23 publications

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Contact Info

Product

Resources

About