“…Hierarchical/Graphical Models in Computer Vision: Hierarchical/graphical models are powerful for building structured representations, which can reflect task-specific relations and constraints. From early distributional semantic models, part-based models [16,17], MRF/CRF [31], And-Or grammar model [59], to deep structural networks [30,15], graph neural networks [20], trainable CRF [79], etc., hierarchical/graphical models have found applications in a wide variety of core computer vision tasks, such as object recognition [55], human parsing [40,41,81], pose estimation [34,66,61,68,35], visual dialog etc., to the extent that they are now ubiquitous in the field. Inspired by their general success, we leverage structural information to design our approach.…”