“…The huge amounts of video data that are becoming more and more available through online sources, enable and at the same time require continuously better performance in tasks like activity recognition, video saliency, scene analysis or video summarization, imposing the need to exploit not only spatial information, but also temporal [7,47,27]. Similar advances have also been achieved in audio processing areas, such as acoustic event detection [44], speech recognition [17,9], sound localiza-tion [56], by using deep learning techniques.…”