“…In the relatively young and challenging field of Music Information Retrieval (MIR), the wide adoption of practical musical recommendation systems induces a research shift to contextual feature categories [16]- [18], which could be associated with audio-based techniques for enhancement [19], [20]. On the other hand, in the more sophisticated Video Retrieval where visual attributes, audio features, and narration text content are coupled, with standard protocol disappointingly absent [21], [22], partly due to the lack of high-quality training dataset and supporting information for queries [23]. Popular techniques such as deep learning and agents networks are often deployed to effect improvements in performance [24]- [27], and existing methods sometimes involve fusing distinct categories of information together through feature learning [28].…”