Composite Relationship Fields with Transformers for Scene Graph Generation

Adaimi, George; Mizrahi, David; Alahi, Alexandre

doi:10.1109/wacv56688.2023.00014

Cited by 2 publications

(9 citation statements)

References 90 publications

(114 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead of pooling features of various shapes, extracting features at multiple pixels is much faster and consumes less memory. Several works [17], [18], [19] explore such point-based entity representation for SGG. Pixel2Graph [17] grounds edges at the midpoints between the bounding box centers of subjects and objects (referred to as subject and object centers for the rest of the paper).…”

Section: Feature Representationsmentioning

confidence: 99%

“…2) Point-based: single-pixel features extracted from bounding box centers. These methods [17], [18], [19] utilize anchor-free detectors [11], [13], [14] to ground entities and relationships in a regression fashion. 3) Query-based: fixed-size learnable embeddings.…”

Section: Introductionmentioning

confidence: 99%

“…Regarding the relation representations, most works use compositional contextual features to perform predicate classification. Few researchers [18], [19] have explored different ways to represent relations in a regression fashion. These methods represent entities as keypoints (e.g., entity centers), and relationships as points [17] or vectors [18], [19].…”

Section: Introductionmentioning

confidence: 99%

“…Few researchers [18], [19] have explored different ways to represent relations in a regression fashion. These methods represent entities as keypoints (e.g., entity centers), and relationships as points [17] or vectors [18], [19]. By regressing and grounding entities and relations geometrically, these methods achieve faster inference speed which is useful for down-stream tasks.…”

Section: Introductionmentioning

confidence: 99%

“…By regressing and grounding entities and relations geometrically, these methods achieve faster inference speed which is useful for down-stream tasks. However, regression with handcrafted targets [18], [19] is deficient, especially on sparselyannotated datasets [24]. Consequently, the performance of regression is inferior compared to predicate classification.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Fully Convolutional Scene Graph Generation

Liu

Yan

Mortazavi

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Scene Graph Generation (SGG) has achieved significant progress recently. However, most previous works rely heavily on fixed-size entity representations based on bounding box proposals, anchors, or learnable queries. As each representation's cardinality has different trade-offs between performance and computation overhead, extracting highly representative features efficiently and dynamically is both challenging and crucial for SGG. In this work, a novel architecture called RepSGG is proposed to address the aforementioned challenges, formulating a subject as queries, an object as keys, and their relationship as the maximum attention weight between pairwise queries and keys. With more fine-grained and flexible representation power for entities and relationships, RepSGG learns to sample semantically discriminative and representative points for relationship inference. Moreover, the long-tailed distribution also poses a significant challenge for generalization of SGG. A run-time performance-guided logit adjustment (PGLA) strategy is proposed such that the relationship logits are modified via affine transformations based on run-time performance during training. This strategy encourages a more balanced performance between dominant and rare classes. Experimental results show that RepSGG achieves the state-of-the-art or comparable performance on the Visual Genome and Open Images V6 datasets with fast inference speed, demonstrating the efficacy and efficiency of the proposed methods.

show abstract

Section: Feature Representationsmentioning

confidence: 99%