“…Thus, it is important to have precise spectrum representation with maximum information gain and no redundancy. For example, in the case of image and text, images can be Image retrieval in different categories 2014, 2019 [38], [39] 24 distinct category image-text retrieval 2012 [40] Disaster and emergency management 2016 [25] image-text retrieval in various categories 2017, 2018 [41], [42] 10 I, T, V Image-text and video-text retrieval in multiple categories 2015 [43] Video, image and text retrieval in video lectures 2014 [44] 11 T, V Multiple concepts' video annotation 2011 [45] cooking activities' video annotation, videos' temporal activity localization evaluation, personal videos' annotation 2019 [46] Cooking recipe retrieval 2019 [28] represented in spatial or spectral while the text is symbolic and dependent upon grammar rules and cultural norms [2].…”