“…Similarly, [7] proposes a lighter version of [61] by encoding the 3D shapes into graphs using node embeddings [15]. However, these multi-modal methods are limited as 3D shapes are oftentimes unavailable at testing [43,57,58]; The other category is the imagebased methods [13,37,59,61,63,66,68], that is only images are exploited for pose estimation. [13,37] regard corners of the 3D bounding box as generic keypoints, which only focus on cubic objects with simple geometric shape.…”