End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-character Recognition Model

Stoller, Daniel; Durand, Simon; Ewert, Sebastian

doi:10.1109/icassp.2019.8683470

Cited by 54 publications

(90 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare the performance of a standard ASR trained on extracted singing vocals and polyphonic audio for the tasks of lyrics alignment (Table 3) and transcription ( Table 4). The alignment performance is measured as the mean absolute word boundary error (AE) for each song, averaged over all songs of a dataset, in seconds [12,14], and lyrics transcription performance is measured as the word error rate (WER) which is a standard performance measure for ASR systems. We see an improvement in both alignment and transcription performance with ASR trained on polyphonic data than vocal extracted data, on all the test datasets.…”

Section: Singing Vocal Extraction Vs Polyphonic Audiomentioning

confidence: 99%

Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

Gupta

Yılmaz

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Background music affects lyrics intelligibility of singing vocals in a music piece. Automatic lyrics alignment and transcription in polyphonic music are challenging tasks because the singing vocals are corrupted by the background music. In this work, we propose to learn music genre-specific characteristics to train polyphonic acoustic models. We first compare several automatic speech recognition pipelines for the application of lyrics transcription. We then present the lyrics alignment and transcription performance of musicinformed acoustic models for the best-performing pipeline, and systematically study the impact of music genre and language model on the performance. With such genre-based approach, we explicitly model the music without removing it during acoustic modeling. The proposed approach outperforms all competing systems in the lyrics alignment and transcription tasks on several well-known polyphonic test datasets.

show abstract

Section: Singing Vocal Extraction Vs Polyphonic Audiomentioning

confidence: 99%

Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

Gupta

Yılmaz

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…For the acoustic model F, we rely on a Convolutionnal Recurrent Network (CRNN) trained with the CTC algorithm, as in [16]. CTC-based acoustic models were successfully implemented for singing-related tasks, such as lyrics-to-audio alignment [13,15] and keyword spotting [16]. Following their works, we also employ a singing voice separation preprocessing step during training and inference, which improves performance over using polyphonic data in [13].…”

Section: Phoneme Recognitionmentioning

confidence: 99%

“…CTC-based acoustic models were successfully implemented for singing-related tasks, such as lyrics-to-audio alignment [13,15] and keyword spotting [16]. Following their works, we also employ a singing voice separation preprocessing step during training and inference, which improves performance over using polyphonic data in [13]. For the recurrent layers, we choose bidirectional Long Short-Term Memory (LSTM) cells to take the full sequence into account when predicting characters at each time frame.…”

Section: Phoneme Recognitionmentioning

confidence: 99%

“…Recent works have trained new acoustic models with singing data and show great results in lyrics transcription [13], lyrics-to-audio alignment [14,15] and explicit content detection [16], using more recent DNN techniques. In this work, we propose to apply these advances to a phonotactic SLID system: in particular, the usage of the CTC algorithm allows the acoustic model to be trained with DALI, a large multilingual singing dataset [17], while alleviating the need for frame-level aligned lyrics.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Singing Language Identification Using a Deep Phonotactic Approach

Renault¹,

Vaglio²,

Hennequin³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Extensive works have tackled Language Identification (LID) in the speech domain, however their application to the singing voice trails and performances on Singing Language Identification (SLID) can be improved leveraging recent progresses made in other singing related tasks. This work presents a modernized phonotactic system for SLID on polyphonic music: phoneme recognition is performed with a Connectionist Temporal Classification (CTC)-based acoustic model trained with multilingual data, before language classification with a recurrent model based on the phonemes estimation. The full pipeline is trained and evaluated with a large and publicly available dataset, with unprecedented performances. First results of SLID with out-of-set languages are also presented.

show abstract

“…However, the main obstacle for exploiting side information remains: accurately labeled data is expensive to create and thus rare. For example, aligning a score on the note level or lyrics on the phoneme level would require manual annotations, as creating such fine alignment automatically remains an open problem [15,16]. On the other hand, weak side information such as non-aligned scores or lyrics is often easily available but not straightforward to employ.…”

Section: Introductionmentioning

confidence: 99%

Weakly Informed Audio Source Separation

Schulze-Forster

Doire

Richard

et al. 2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

View full text Add to dashboard Cite

Prior information about the target source can improve audio source separation quality but is usually not available with the necessary level of audio alignment. This has limited its usability in the past. We propose a separation model that can nevertheless exploit such weak information for the separation task while aligning it on the mixture as a byproduct using an attention mechanism. We demonstrate the capabilities of the model on a singing voice separation task exploiting artificial side information with different levels of expressiveness. Moreover, we highlight an issue with the common separation quality assessment procedure regarding parts where targets or predictions are silent and refine a previous contribution for a more complete evaluation.

show abstract

End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-character Recognition Model

Cited by 54 publications

References 16 publications

Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

Singing Language Identification Using a Deep Phonotactic Approach

Weakly Informed Audio Source Separation

Contact Info

Product

Resources

About