Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA

Zeng, Donghuo; Yu, Yi; Oyama, Keizo

doi:10.1109/ism.2018.00-21

Cited by 36 publications

(23 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cross-modal recognition: Cross-modal recognition approaches using embedding have attracted much attention as a technique that can perform effective bidirectional recognition between different modalities (e.g., image, text and audio). Related to audio processing, some researchers explored cross-modal recognition between audio and image [42], [43] and the one between audio and text (lyrics) [44]. But, to the best of our knowledge, no existing work addresses crossmodal recognition between audio and emotion except our previous study [45], where MultiLayer Perceptrons (MLPs) based on CCA loss are used to compute music and emotion embeddings.…”

Section: Related Workmentioning

confidence: 99%

Embedding-based Music Emotion Recognition Using Composite Loss

Takashima¹,

Li²,

Grzegorzek³

et al. 2021

Preprint

View full text Add to dashboard Cite

Most music emotion recognition approaches use one-way classification or regression that estimates a general emotion from a distribution of music samples, but without considering emotional variations (e.g., happiness can be further categorised into much, moderate or little happiness). We propose a cross-modal music emotion recognition approach that associates music samples with emotions in a common space by considering both of their general and specific characteristics. Since the association of music samples with emotions is uncertain due to subjective human perceptions, we compute composite loss-based embeddings obtained to maximise two statistical characteristics, one being the correlation between music samples and emotions based on canonical correlation analysis, and the other being a probabilistic similarity between a music sample and an emotion with KL-divergence. Experiments on two benchmark datasets demonstrate the superiority of our approach over one-way baselines. In addition, detailed analysis show that our approach can accomplish robust cross-modal music emotion recognition that not only identifies music samples matching with a specific emotion but also detects emotions expressed in a certain music sample.

show abstract

Section: Related Workmentioning

confidence: 99%

Embedding-based Music Emotion Recognition Using Composite Loss

Takashima¹,

Li²,

Grzegorzek³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Supervised approaches: In the case of supervised learning, the matching criterion that associates the audio and video modalities is deduced from additional sources of information. Typically, mood tags [26] [33] or projections into the valence-arousal plane [23] can be used to recommend musics and videos that have a similar emotional content. The use of mood information accelerates the training, and allows the systems to reach promising retrieval performances.…”

Section: A Music-video Embeddingsmentioning

confidence: 99%

“…Examples of systems for music recommendation given video as input are [11], [12], [23], [26], [27]. In a symmetrical way, examples of systems for video recommendation given audio as input are [13], [33].…”

Section: B Usages Of Music-video Embeddingsmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Modal Music-Video Recommendation: A Study of Design Choices

Pretet¹,

Richard²,

Peeters³

2021

Preprint

View full text Add to dashboard Cite

In this work, we study music/video crossmodal recommendation, i.e. recommending a music track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. More precisely, we jointly learn audio and video embeddings by using their co-occurrence in music-video clips. In this work, we build upon a recent video-music retrieval system (the VM-NET), which originally relies on an audio representation obtained by a set of statistics computed over handcrafted features. We demonstrate here that using audio representation learning such as the audio embeddings provided by the pre-trained MuSimNet, OpenL3, MusicCNN or by AudioSet, largely improves recommendations. We also validate the use of the cross-modal triplet loss originally proposed in the VM-NET compared to the binary cross-entropy loss commonly used in selfsupervised learning. We perform all our experiments using the Music Video Dataset (MVD).

show abstract

“…Zheng et al implemented a cross-modal of an audio-video embedding algorithm through Supervised Deep Canonical Correlation Analysis (S-DCCA)[101]. In this model, audio and video are projected into a shared area to address the semantic distance between audio and video.…”

mentioning

confidence: 99%

Automatic Assessment of Dysarthric Severity Level Using Audio-Video Cross-Modal Approach in Deep Learning

Han¹,

Sharifzadeh

McLoughlin

2020

Interspeech 2020

View full text Add to dashboard Cite

Dysarthria is a speech disorder disease that can have a significant impact on a person's daily life. Early detection of the disease can put the patient into therapy sessions more quickly. Researchers have established various approaches to detect the disease automatically. Traditional computational approaches commonly analysed acoustic features like Mel-Frequency Cepstral Coefficients (MFCC), Spectral Centroid, Linear Prediction Cepstral (LPC) coefficients and Perceptual Linear Prediction (PLP) from speech samples of patients to detect dysarthric speech characters like slow speech rate, short pauses, mis-articulated sounds, etc. Recent research has shown that some machine learning algorithms can also be deployed to extract speech features and detect the severity level automatically.

show abstract

Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA

Cited by 36 publications

References 21 publications

Embedding-based Music Emotion Recognition Using Composite Loss

Embedding-based Music Emotion Recognition Using Composite Loss

Cross-Modal Music-Video Recommendation: A Study of Design Choices

Automatic Assessment of Dysarthric Severity Level Using Audio-Video Cross-Modal Approach in Deep Learning

Contact Info

Product

Resources

About