Qingzhong Wang scite author profile

Recently, the state-of-the-art models for image captioning have overtaken human performance based on the most popular metrics, such as BLEU, METEOR, ROUGE and CIDEr. Does this mean we have solved the task of image captioning? The above metrics only measure the similarity of the generated caption to the human annotations, which reflects its accuracy. However, an image contains many concepts and multiple levels of detail, and thus there is a variety of captions that express different concepts and details that might be interesting for different humans. Therefore only evaluating accuracy is not sufficient for measuring the performance of captioning models -the diversity of the generated captions should also be considered. In this paper, we proposed a new metric for measuring diversity of image captions, which is derived from latent semantic analysis and kernelized to use CIDEr similarity. We conduct extensive experiments to re-evaluate recent captioning models in the context of both diversity and accuracy. We find that there is still a large gap between the model and human performance in terms of both accuracy and diversity, and that models that have optimized accuracy (CIDEr) have low diversity. We also show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy of the generated captions. arXiv:1903.12020v3 [cs.CV] 15 May 2019 Measuring Diversity of Image CaptionsCurrently, the widely used metrics, such as BLEU, CIDEr, and SPICE are for a single caption prediction. To evaluate a set of captions C = {c 1 , c 2 , · · · , c m }, two dimensions are required: accuracy and diversity. For accuracy,

show abstract

Kernel-Based Density Map Generation for Dense Object Counting

Wan

Wang

Chan

2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Generating Person Images with Appearance-aware Pose Stylizer

Huang

Xiong

Cheng

et al. 2020

View full text Add to dashboard Cite

Generation of high-quality person images is challenging, due to the sophisticated entanglements among image factors, e.g., appearance, pose, foreground, background, local details, global structures, etc. In this paper, we present a novel end-to-end framework to generate realistic person images based on given person poses and appearances. The core of our framework is a novel generator called Appearance-aware Pose Stylizer (APS) which generates human images by coupling the target pose with the conditioned person appearance progressively. The framework is highly flexible and controllable by effectively decoupling various complex person image factors in the encoding phase, followed by re-coupling them in the decoding phase. In addition, we present a new normalization method named adaptive patch normalization, which enables region-specific normalization and shows a good performance when adopted in person image generation model. Experiments on two benchmark datasets show that our method is capable of generating visually appealing and realistic-looking results using arbitrary image and pose inputs.

show abstract

On Diversity in Image Captioning: Metrics and Methods

Wang

Wan

Chan

2022

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Gated Hierarchical Attention for Image Captioning

Wang

Chan

2019

View full text Add to dashboard Cite

Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we propose a bottomup gated hierarchical attention (GHA) mechanism for image captioning. Our proposed model employs a CNN as the decoder which is able to learn different concepts at different layers, and apparently, different concepts correspond to different areas of an image. Therefore, we develop the GHA in which low-level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions. Our GHA significantly improves the performance of the model that only applies one level attention, e.g., the CIDEr score increases from 0.923 to 0.999, which is comparable to the state-of-theart models that employ attributes boosting and reinforcement learning (RL). We also conduct extensive experiments to analyze the CNN decoder and our proposed GHA, and we find that deeper decoders cannot obtain better performance, and when the convolutional decoder becomes deeper the model is likely to collapse during training. Code is available: https://github.com/qingzwang/GHA-ImageCaptioning. Keywords: Hierarchical Attention · Image Captioning · Convolutional Decoder. Recently, CNNs are the most popular vision module, such as VGG nets [33], Google nets [35] and residual nets [14] (in this paper, we call them Image-CNNs). It is believed that introducing more information benefits the performance, and hence some models employ object detection or transfer image features into attributes to obtain more details or semantic information of an image [2,9,46,42,45,11]. However, applying object detection or attributes boosting

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Qingzhong Wang

Describing Like Humans: On Diversity in Image Captioning

Kernel-Based Density Map Generation for Dense Object Counting

Generating Person Images with Appearance-aware Pose Stylizer

On Diversity in Image Captioning: Metrics and Methods

Gated Hierarchical Attention for Image Captioning

Contact Info

Product

Resources

About