RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

Zhang, Xuying; Sun, Xiaoshuai; Luo, Yunpeng; Ji, Jiayi; Zhou, Yiyi; Wu, Yongjian; Huang, Feiyue; Ji, Rongrong

doi:10.1109/cvpr46437.2021.01521

Cited by 164 publications

(126 citation statements)

References 23 publications

Supporting

Mentioning

126

Contrasting

Order By: Relevance

“…Adding the mesh-like connectivity to the decoder further improves the results to 140.6 CIDEr points. This represents an increase of 5.0 CIDEr points with respect to the current state of the art when training on the COCO dataset exclusively [44]. Further, in Fig.…”

Section: Comparison With the State Of The Artmentioning

confidence: 79%

“…Language model. Despite RNN-based language models have been the standard strategy for generating the caption, convolutional language models [40] and fully-attentive language models [14], [41], [42], [43], [44] based on the Transformer paradigm [45] have been explored for image captioning, also motivated by the success of these approaches on Natural Language Processing tasks such as machine translation and language understandings [45], [46], [47]. Moreover, the introduction of Transformer-based language models has brought to the development of effective variants or modifications of the self-attention operator [7], [11], [12], [13], [48], [49], [8] and has enabled vision-and-language early-fusion [19], [22], [50], based on BERT-like architectures [46].…”

Section: Related Workmentioning

confidence: 99%

“…GCN-LSTM [34], SGAE [35], and MT [36]) or self-attention (i.e. AoANet [7], X-LAN [12], DPA [8], and TCTS [51]), and captioning architectures entirely based on the Transformer network such as ORT [11], M 2 Transformer [13], X-Transformer [12], CPTR [37], DLCT [14], and RSTNet [44].…”

Section: Comparison With the State Of The Artmentioning

confidence: 99%

See 2 more Smart Citations

CaMEL: Mean Teacher Learning for Image Captioning

Barraco¹,

Stefanini²,

Cornia³

et al. 2022

Preprint

View full text Add to dashboard Cite

Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors. When comparing with existing proposals, we demonstrate that our model provides stateof-the-art caption quality with a significantly reduced number of parameters. According to the CIDEr metric, we obtain a new state of the art on COCO when training without using external data. The source code and trained models are publicly available at: https://github.com/aimagelab/camel.

show abstract

Section: Comparison With the State Of The Artmentioning

confidence: 79%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

CaMEL: Mean Teacher Learning for Image Captioning

Barraco¹,

Stefanini²,

Cornia³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Cornia et al [3] proposed the Meshed-Memory Transformer model, which included a multi-layer encoder for region features and a multi-layer decoder that generated output sentences; A mesh-like structure was also proposed to connect encoding and decoding layers to exploit both low-level and high-level contributions. The exploration of self-attention mechanism in the Image Captioning problem are still trendy up to now; many studies improve the performance on this problem via this direction [4,5,20]. On the other hand, some studies realized that just embedding visual contents was not enough; then, they attempted to combine some semantic features such as name entities or attributes of the relationship.…”

Section: ) Previous Approachesmentioning

confidence: 99%

“…This is an extremely challenging task because traditional captioning models were not adapted to the Text-based Image Captioning problem when now the hypothesis caption should be conditioned in scene texts. However, previous models just utilized only visual entities [2,3,4] or global semantic of the image [5], they completely ignored scene texts.…”

Section: Introductionmentioning

confidence: 99%

EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

et al. 2022

View full text Add to dashboard Cite

Text-based Image Captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image. Therefore, the ways image and scene texts are embedded into the main model for training is crucial. Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. In detail, our EAES module contains two significant sub-modules: Objects-augmented and Grid feature augmentation. With the Objects-augmented module, we provide the relative geometry feature, representing the relation between objects and between OCR tokens. Furthermore, we extract the grid feature for an image with the Grid feature augmentation module and combine it with visual objects, which help the model focus on both salient objects and the general context of an image, leading to better performance. We use the TextCaps dataset as the benchmark to prove the effectiveness of our approach on five standard metrics: BLEU4, METEOR, ROUGE-L, SPICE and CIDEr. Without bells and whistles, our method achieves 20.21% on the BLEU4 metric and 85.78% on the CIDEr metric, 1.31% and 4.78% higher, respectively, than the baseline M4C-Captioner method. Furthermore, the results are incredibly competitive with other methods on METEOR, ROUGE-L and SPICE metrics. INDEX TERMS image captioning, text-based image captioning, bottom-up top-down, grid feature, multimodal transformer, m4c

show abstract

Key Technology of Automation Control Based on Artificial Intelligence Technology

Wang

2021

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github. com/alibaba-mmai-research/MoLo.

show abstract

RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

Cited by 164 publications

References 23 publications

CaMEL: Mean Teacher Learning for Image Captioning

CaMEL: Mean Teacher Learning for Image Captioning

EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

Key Technology of Automation Control Based on Artificial Intelligence Technology

Contact Info

Product

Resources

About