ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683470
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-character Recognition Model

Abstract: Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form.Here, we present a novel system based on a modified … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
90
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 54 publications
(90 citation statements)
references
References 16 publications
0
90
0
Order By: Relevance
“…We compare the performance of a standard ASR trained on extracted singing vocals and polyphonic audio for the tasks of lyrics alignment (Table 3) and transcription ( Table 4). The alignment performance is measured as the mean absolute word boundary error (AE) for each song, averaged over all songs of a dataset, in seconds [12,14], and lyrics transcription performance is measured as the word error rate (WER) which is a standard performance measure for ASR systems. We see an improvement in both alignment and transcription performance with ASR trained on polyphonic data than vocal extracted data, on all the test datasets.…”
Section: Singing Vocal Extraction Vs Polyphonic Audiomentioning
confidence: 99%
“…We compare the performance of a standard ASR trained on extracted singing vocals and polyphonic audio for the tasks of lyrics alignment (Table 3) and transcription ( Table 4). The alignment performance is measured as the mean absolute word boundary error (AE) for each song, averaged over all songs of a dataset, in seconds [12,14], and lyrics transcription performance is measured as the word error rate (WER) which is a standard performance measure for ASR systems. We see an improvement in both alignment and transcription performance with ASR trained on polyphonic data than vocal extracted data, on all the test datasets.…”
Section: Singing Vocal Extraction Vs Polyphonic Audiomentioning
confidence: 99%
“…For the acoustic model F, we rely on a Convolutionnal Recurrent Network (CRNN) trained with the CTC algorithm, as in [16]. CTC-based acoustic models were successfully implemented for singing-related tasks, such as lyrics-to-audio alignment [13,15] and keyword spotting [16]. Following their works, we also employ a singing voice separation preprocessing step during training and inference, which improves performance over using polyphonic data in [13].…”
Section: Phoneme Recognitionmentioning
confidence: 99%
“…CTC-based acoustic models were successfully implemented for singing-related tasks, such as lyrics-to-audio alignment [13,15] and keyword spotting [16]. Following their works, we also employ a singing voice separation preprocessing step during training and inference, which improves performance over using polyphonic data in [13]. For the recurrent layers, we choose bidirectional Long Short-Term Memory (LSTM) cells to take the full sequence into account when predicting characters at each time frame.…”
Section: Phoneme Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, the main obstacle for exploiting side information remains: accurately labeled data is expensive to create and thus rare. For example, aligning a score on the note level or lyrics on the phoneme level would require manual annotations, as creating such fine alignment automatically remains an open problem [15,16]. On the other hand, weak side information such as non-aligned scores or lyrics is often easily available but not straightforward to employ.…”
Section: Introductionmentioning
confidence: 99%