“…Scene graph generation (SGG) [13], which is a visual task to detect objects and recognize semantic relationships between different objects in an image, can serve as a powerful structural representation of images and benefit other high-level Vision-and-Language tasks such as image generation [12,33,40], image retrieval [13,24,28,34], visual question answering [6,18,39] and image captioning [7,19,38]. Taking advantage of the remarkable feature representations of convolutional neural networks (CNNs) [16] and diverse contextual feature fusion strategies (e.g., message passing [20,37], lstm [41]), a variety of methods have made significant progress to improve the recall evaluation metric performance of SGG tasks.…”