“…The task of recognizing objects and the relationships has been investigated by numerous studies in a various form. This includes detection of human-object interactions [7,3], localization of proposals from natural language expressions [12], or the more general tasks of visual relationship detection [17,25,38,5,19,37,34,41] and scene graph generation [33,18,35,22].…”