Concept-Enhanced Relation Network for Video Visual Relation Inference

Cao, Qianwen; Huang, Heyan; Ren, Mucheng; Yuan, Changsen

doi:10.1109/tcsvt.2022.3220426

Cited by 1 publication

(2 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cao et al [49] proposed using comprehensive semantic representations that are useful for knowledge transfer across relationships to solve the VidVRD problem. Their approach, the Concept-Enhanced Relation Network (CKERN) produces conceptually richer semantic representations of the detected object pairs, and then predicts the relationship based on the integration of multi-modal features.…”

Section: A Video Relationship Detectionmentioning

confidence: 99%

See 1 more Smart Citation

Video Relationship Detection Using Mixture of Experts

2023

View full text Add to dashboard Cite

Machine comprehension of visual information from images and videos by neural networks suffers from two limitations: (1) the computational and inference gap in vision and language to accurately determine which object a given agent acts on and then to represent it by language, and (2) the shortcoming in stability and generalization of the classifier trained by a single, monolithic neural network. To address these limitations, we propose MoE-VRD, a novel approach to visual relationship detection via a mixture of experts. MoE-VRD recognizes language triplets in the form of a < subject, predicate, object > tuple to extract the relationship between subject, predicate, and object from visual processing. Since detecting a relationship between a subject (acting) and the object(s) (being acted upon) requires that the action be recognized, we base our network on recent work in visual relationship detection. To address the limitations associated with single monolithic networks, our mixture of experts is based on multiple small models, whose outputs are aggregated. That is, each expert in MoE-VRD is a visual relationship learner capable of detecting and tagging objects. MoE-VRD employs an ensemble of networks while preserving the complexity and computational cost of the original underlying visual relationship model by applying a sparsely-gated mixture of experts, which allows for conditional computation and a significant gain in neural network capacity. We show that the conditional computation capabilities and massive ability to scale the mixture-of-experts leads to an approach to the visual relationship detection problem which outperforms the state-of-the-art.

show abstract

Section: A Video Relationship Detectionmentioning

confidence: 99%

“…• CKERN [49], which generates comprehensive semantic representations by incorporating retrieved concepts with local semantics.…”

Section: Multi-expert Performancementioning

confidence: 99%