Olga Slizovskaia scite author profile

The focus of this work is to study how to efficiently tailor Convolutional Neural Networks (CNNs) towards learning timbre representations from log-mel magnitude spectrograms. We first review the trends when designing CNN architectures. Through this literature overview we discuss which are the crucial points to consider for efficiently learning timbre representations using CNNs. From this discussion we propose a design strategy meant to capture the relevant time-frequency contexts for learning timbre, which permits using domain knowledge for designing architectures. In addition, one of our main goals is to design efficient CNN architectures -what reduces the risk of these models to over-fit, since CNNs' number of parameters is minimized. Several architectures based on the design principles we propose are successfully assessed for different research tasks related to timbre: singing voice phoneme classification, musical instrument recognition and music auto-tagging.

show abstract

End-to-end Sound Source Separation Conditioned on Instrument Labels

Slizovskaia

Kim

Haro

2019

View full text Add to dashboard Cite

Can we perform an end-to-end music source separation with a variable number of sources using a deep learning model? This paper presents an extension of the Wave-U-Net [1] model which allows end-to-end monaural source separation with a non-fixed number of sources. Furthermore, we propose multiplicative conditioning with instrument labels at the bottleneck of the Wave-U-Net and show its effect on the separation results. This approach can be further extended to other types of conditioning such as audio-visual source separation and score-informed source separation.

show abstract

Vocoder-Based Speech Synthesis from Silent Videos

Michelsanti¹,

Slizovskaia²,

Haro³

et al. 2020

View full text Add to dashboard Cite

Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.

show abstract

Vocoder-Based Speech Synthesis from Silent Videos

Michelsanti¹,

Slizovskaia²,

Haro³

et al. 2020

Preprint

View full text Add to dashboard Cite

Musical Instrument Recognition in User-generated Videos using a Multimodal Convolutional Neural Network Architecture

Slizovskaia

Haro

2017

View full text Add to dashboard Cite

is paper presents a method for recognizing musical instruments in user-generated videos. Musical instrument recognition from music signals is a well-known task in the music information retrieval (MIR) eld, where current approaches rely on the analysis of the good-quality audio material. is work addresses a realworld scenario with several research challenges, i.e. the analysis of user-generated videos that are varied in terms of recording conditions and quality and may contain multiple instruments sounding simultaneously and background noise. Our approach does not only focus on the analysis of audio information, but we exploit the multimodal information embedded in the audio and visual domains. In order to do so, we develop a Convolutional Neural Network (CNN) architecture which combines learned representations from both modalities at a late fusion stage. Our approach is trained and evaluated on two large-scale video datasets: YouTube-8M and FCVID. e proposed architectures demonstrate state-of-the-art results in audio and video object recognition, provide additional robustness to missing modalities, and remains computationally cheap to train. CCS CONCEPTS •Information systems → Multimedia and multimodal retrieval; Music retrieval; •Computing methodologies → Neural networks; Visual content-based indexing and retrieval;

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.