Event Localization in Music Auto-tagging

Jeng

2019

IEEE Trans. Multimedia

Self Cite

Instrument playing is among the most common scenes in music-related videos, which represent nowadays one of the largest sources of online videos. In order to understand the instrument-playing scenes in the videos, it is important to know what instruments are played, when they are played, and where the playing actions occur in the scene. While audiobased recognition of instruments has been widely studied, the visual aspect of the music instrument playing remains largely unaddressed in the literature. One of the main obstacles is the difficulty in collecting annotated data of the action locations for training-based methods. To address this issue, we propose a weakly-supervised framework to find when and where the instruments are played in the videos. We propose to use two auxiliary models, a sound model and an object model, to provide supervisions for training the instrument-playing action model. The sound model provides temporal supervisions, while the object model provides spatial supervisions. They together can simultaneously provide temporal and spatial supervisions. The resulted model only needs to analyze the visual part of a music video to deduce which, when and where instruments are played. We found that the proposed method significantly improves the localization accuracy. We evaluate the result of the proposed method temporally and spatially on a small dataset (totally 5,400 frames) that we manually annotated.

Section: Methodsmentioning

confidence: 99%

Section: B Audio Classificationmentioning

confidence: 99%

Weakly-Supervised Visual Instrument-Playing Action Detection in Videos

Liu

Jeng

2019

IEEE Trans. Multimedia

Self Cite

“…• Since the network is supposed to handle variable length speech signals, we opt for a fully-convolutional architecture [17]. Following [4,18], we use "1D convolutional" [19] layers rather than 2D convolutional layers, to add flexibility of using recurrent layers in conjunction with the convolutional layers.…”

Section: Network Architecturementioning

confidence: 99%

Speech-To-Singing Conversion in an Encoder-Decoder Framework

Parekh

Rao

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2020

Self Cite

In this paper our goal is to convert a set of spoken lines into sung ones. Unlike previous signal processing based methods, we take a learning based approach to the problem. This allows us to automatically model various aspects of this transformation, thus overcoming dependence on specific inputs such as high quality singing templates or phoneme-score synchronization information. Specifically, we propose an encoder-decoder framework for our task. Given timefrequency representations of speech and a target melody contour, we learn encodings that enable us to synthesize singing that preserves the linguistic content and timbre of the speaker while adhering to the target melody. We also propose a multi-task learning based objective to improve lyric intelligibility. We present a quantitative and qualitative analysis of our framework.

“…In our model, the convolution layers use 1D convolutions, namely doing convolutions along the temporal axis [6], [7]. The output tensor of an 1D convolution layer takes the shape (channels, temporal points).…”

Section: A Separation Modelmentioning

confidence: 99%

Denoising Auto-Encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation

Liu

2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

2018

Self Cite

Convolutional neural networks with skip connections have shown good performance in music source separation. In this work, we propose a denoising Auto-encoder with Recurrent skip Connections (ARC). We use 1D convolution along the temporal axis of the time-frequency feature map in all layers of the fully-convolutional network. The use of 1D convolution makes it possible to apply recurrent layers to the intermediate outputs of the convolution layers. In addition, we also propose an enhancement network and a residual regression method to further improve the separation result. The recurrent skip connections, the enhancement module, and the residual regression all improve the separation quality. The ARC model with residual regression achieves 5.74 siganl-to-distoration ratio (SDR) in vocals with MUSDB in SiSEC 2018. We also evaluate the ARC model alone on the older dataset DSD100 (used in SiSEC 2016) and it achieves 5.91 SDR in vocals.