“…Booted by the development of Deep Learning, letting the computer understand an image seems to be increasingly closer. With the research on object detection gradually becoming mature [37,36,32,24,22,23], increasingly more researchers put their attention on higher-level understanding of the scene [21,51,2,48,49,46,9,6,7,47]. As an intermediate level task connecting the image caption and object detection, visual relationship/phrase detection is gaining more attention in scene understanding [33,41,3].…”