“…In this paper we follow a widely used approach for basic unimodal feature extraction and multimodal alignment, similar to the one in various proposed methods, such as [8,2,5,15,3,16,10,17,9,13]. More specifically, after extracting the features on each modality(visual, textual and aural), the procedure of word-level alignment that was firstly used for this task in [17], is performed.…”