Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning

Khan, Muhammad Jaleed; Breslin, John G.; Curry, Edward

doi:10.1007/978-3-031-06981-9_6

Cited by 6 publications

(19 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Capturing all these relationships in a finite training dataset is nearly impossible. Therefore, the integration of common sense knowledge, including statistical priors [15,132,140], language priors [66,73,129], and KGs [28,30,49,50,52,130], becomes crucial. Common sense knowledge infusion helps bridge the gap between the limited training data and the vast semantic space, enabling a more accurate and comprehensive representation of relationships within a scene.…”

Section: Semantic Scene Representationmentioning

confidence: 99%

“…NeSy integration in SGG techniques can be loosely or tightly coupled. In loose coupling, [30,50,52,132] the neural and symbolic components operate independently, interacting as needed, and focus on distinct yet complementary tasks. Meanwhile, tight coupling [11,15,28,49,66,73,129,130,140] deeply integrates symbolic and neural components, either incorporating symbolic knowledge directly into the neural network architecture or encoding it into the network's distributed representation.…”

Section: Semantic Scene Representationmentioning

confidence: 99%

“…For instance, the relationship between a person and a bicycle in an image might depend on various factors, such as the position and orientation of the person and the bicycle, and deep learning architectures can learn to capture these complex dependencies. [15,28,30,49,50,52,66,73,130,132,140] use Faster RCNN [90] with a CNN-based backbone network for detecting objects in images prior to visual relationship detection. Khan et al [52] used the feature maps extracted from the underlying CNN in Faster RCNN by applying RoIAlign to the image regions to obtain local and global region features of each detected object, which forms the basis for further processing and relationship prediction.…”

Section: Deep Learning Architecturesmentioning

confidence: 99%

“…[15,28,30,49,50,52,66,73,130,132,140] use Faster RCNN [90] with a CNN-based backbone network for detecting objects in images prior to visual relationship detection. Khan et al [52] used the feature maps extracted from the underlying CNN in Faster RCNN by applying RoIAlign to the image regions to obtain local and global region features of each detected object, which forms the basis for further processing and relationship prediction. DSGAT [140] incorporated Faster RCNN with a VGG16 backbone in its bounding box module to generate object proposals prior to visual relationship detection.…”

Section: Deep Learning Architecturesmentioning

confidence: 99%

“…However, by integrating relational and background information via NeSy approaches, the network goes beyond mere identification. It begins to understand complex interactions, such as a pedestrian waiting to cross the road or a car stopping at a traffic light, by infusing common sense knowledge from knowledge graphs [50]. This NeSy integration imbues the network with an enhanced ability to reason about the scene, recognising not just the objects but also their interrelations and implied actions.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledge

Khan,

Ilievski,

Breslin

et al. 2024

NAI

Self Cite

View full text Add to dashboard Cite

Combining deep learning and common sense knowledge via neurosymbolic integration is essential for semantically rich scene representation and intuitive visual reasoning. This survey paper delves into data- and knowledge-driven scene representation and visual reasoning approaches based on deep learning, common sense knowledge and neurosymbolic integration. It explores how scene graph generation, a process that detects and analyses objects, visual relationships and attributes in scenes, serves as a symbolic scene representation. This representation forms the basis for higher-level visual reasoning tasks such as visual question answering, image captioning, image retrieval, image generation, and multimodal event processing. Infusing common sense knowledge, particularly through the use of heterogeneous knowledge graphs, improves the accuracy, expressiveness and reasoning ability of the representation and allows for intuitive downstream reasoning. Neurosymbolic integration in these approaches ranges from loose to tight coupling of neural and symbolic components. The paper reviews and categorises the state-of-the-art knowledge-based neurosymbolic approaches for scene representation based on the types of deep learning architecture, common sense knowledge source and neurosymbolic integration used. The paper also discusses the visual reasoning tasks, datasets, evaluation metrics, key challenges and future directions, providing a comprehensive review of this research area and motivating further research into knowledge-enhanced and data-driven neurosymbolic scene representation and visual reasoning.

show abstract