Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Dong, Xinzhi; Long, Chengjiang; Xu, Wenju; Xiao, Chunxia

doi:10.1145/3474085.3475439

Cited by 57 publications

(18 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dong et al. [63] proposed dual graph convolutional networks (Dual‐GCN) with transformer and curriculum learning to explore the contextual relevance between contextual images for image captioning, see Figure 9. Two independent GCNs encode the entire image and the objects from the image, and then the captions are generated by a Transformer linguistic decoder.…”

Section: The Recent Deep Learning Methodsmentioning

confidence: 99%

A thorough review of models, evaluation metrics, and datasets on image captioning

Luo

Cheng

Chao

et al. 2021

IET Image Processing

View full text Add to dashboard Cite

Image captioning means generate descriptive sentences from a query image automatically. It has recently received widespread attention from the computer vision and natural language processing communities as an emerging visual task. Currently, both components have evolved considerably by exploiting object regions, attributes, attention mechanism methods, entity recognition with novelties, and training strategies. However, despite the impressive results, the research has not yet come to a conclusive answer. This survey aims to provide a comprehensive overview of image captioning methods, from technical architectures to benchmark datasets, evaluation metrics, and comparison of state-of-theart methods. In particular, image captioning methods are divided into different categories based on the technique adopted. Representative methods in each class are summarized, and their advantages and limitations are discussed. Moreover, many related state-of-the-art studies were quantitatively compared to determine the recent trends and future directions in image captioning. The ultimate goal of this work is to serve as a tool for understanding the existing literature and highlighting future directions in the area of image captioning for Computer Vision and Natural Language Processing communities may benefit from.

show abstract

Section: The Recent Deep Learning Methodsmentioning

confidence: 99%

A thorough review of models, evaluation metrics, and datasets on image captioning

Luo

Cheng

Chao

et al. 2021

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…Transformer [11,44] has also been adapted to tackle the problem of human motion prediction [1,4]. Similar to GCN, the self-attention mechanism of Transformer can compute pairwise relations of joints.…”

Section: Related Workmentioning

confidence: 99%

Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction

Ma¹,

Nie²,

Long³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents a high-quality human motion prediction method that accurately predicts future human poses given observed ones. Our method is based on the observation that a good "initial guess" of the future poses is very helpful in improving the forecasting accuracy. This motivates us to propose a novel two-stage prediction framework, including an init-prediction network that just computes the good guess and then a formal-prediction network that predicts the target future poses based on the guess. More importantly, we extend this idea further and design a multi-stage prediction framework where each stage predicts initial guess for the next stage, which brings more performance gain. To fulfill the prediction task at each stage, we propose a network comprising Spatial Dense Graph Convolutional Networks (S-DGCN) and Temporal Dense Graph Convolutional Networks (T-DGCN). Alternatively executing the two networks helps extract spatiotemporal features over the global receptive field of the whole pose sequence. All the above design choices cooperating together make our method outperform previous approaches by large margins: 6%-7% on Human3.6M, 5%-10% on CMU-MoCap, and 13%-16% on 3DPW. Code is available at https://github.com/705062791/PGBIG.

show abstract

“…Graph Convolution Network (GCN). Due to the higher representation power of graph structure, GCN has demonstrated superior performance in several tasks, including image caption [8], text to image and human pose estimation [4]. In 3D computer vision, Wald et al [40] proposed the first learning method that generated a semantic scene graph from a 3D point cloud.…”

Section: Related Workmentioning

confidence: 99%

DGECN: A Depth-Guided Edge Convolutional Network for End-to-End 6D Pose Estimation

Cao¹,

Luo²,

Fù³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Monocular 6D pose estimation is a fundamental task in computer vision. Existing works often adopt a twostage pipeline by establishing correspondences and utilizing a RANSAC algorithm to calculate 6 degrees-of-freedom (6DoF) pose. Recent works try to integrate differentiable RANSAC algorithms to achieve an end-to-end 6D pose estimation. However, most of them hardly consider the geometric features in 3D space, and ignore the topology cues when performing differentiable RANSAC algorithms. To this end, we proposed a Depth-Guided Edge Convolutional Network (DGECN) for 6D pose estimation task. We have made efforts from the following three aspects: 1) We take advantages of estimated depth information to guide both the correspondences-extraction process and the cascaded differentiable RANSAC algorithm with geometric information.2) We leverage the uncertainty of the estimated depth map to improve accuracy and robustness of the output 6D pose. 3) We propose a differentiable Perspective-n-Point(PnP) algorithm via edge convolution to explore the topology relations between 2D-3D correspondences. Experiments demonstrate that our proposed network outperforms current works on both effectiveness and efficiency.

show abstract

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Cited by 57 publications

References 51 publications

A thorough review of models, evaluation metrics, and datasets on image captioning

A thorough review of models, evaluation metrics, and datasets on image captioning

Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction

DGECN: A Depth-Guided Edge Convolutional Network for End-to-End 6D Pose Estimation

Contact Info

Product

Resources

About