“…In addition to the cropped instance features, previous methods leverage combined spatial features [3,12,10,14,9,16,46], union box features [34,39], or context features [10,40,30] to improve the accuracy of HOI detection. In order to concentrate on more interactionrelevant features, some methods utilize extra features, such as human pose [37,5,24,14], human parts [47,39,23] and language features [42,9,30,21]. However, the serial architectures of such two-stage methods impair the efficiency of HOI detection.…”