Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

Gupta, Chitralekha; Yılmaz, Emre; Li, Haizhou

doi:10.1109/icassp40776.2020.9054567

Cited by 36 publications

(90 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then a forward-pass decoding algorithm is applied on these posteriograms, obtaining phoneme alignments. Then using a language model (LM), phoneme posteriograms can be converted to word posteriograms to retrieve word-level alignments [1, 3,4]. One recent successful system [2] showed a considerable performance boost compared to previous research using an end-to-end approach trained on a large corpus, where alphabetic characters are used as sub-word units of speech.…”

Section: Related Workmentioning

confidence: 99%

“…In addition, the authors have used a public dataset [9] that is much smaller than the training set used in [2]. Gupta et al [3] reported state-of-the-art results using an acoustic model trained on polyphonic music using genre-specific phonemes. According to the authors, their system applies forced alignment with a large beam size as their system attempts to process the entire music recording at once.…”

Section: Related Workmentioning

confidence: 99%

“…VA [5] also uses an end-to-end model, but trained on the DALI (v1.0) dataset, which has over 200 hours of polyphonic music recordings and extracts the vocals using Spleeter. GC1 [3] uses the same training data, for constructing an acoustic model in the hybrid-ASR setting [25] and performs alignment on the original polyphonic mix as well. In addition to these models, we refer to our models which align words to the vocal tracks separated by Demucs and Spleeter as DE1 and DE2 respectively.…”

Section: Audio-to-lyrics Alignmentmentioning

confidence: 99%

“…The task of aligning song lyrics with their corresponding music recordings is among the most challenging tasks in music information retrieval (MIR) research due to three major factors: the multimodality of the information to be processed -namely music and speech, the presence of the musical accompaniment in the acoustic scene and the length of the music recording to be aligned. For processing linguistically relevant information, previous studies have taken the approach of adapting automatic speech recognition (ASR) paradigms to singing voice signals [1,2,3,4]. Regarding the musical accompaniment, researchers have aligned lyrics on either source separated vocal tracks [4] or utilized acoustic models trained on polyphonic recordings [2,3,5].…”

Section: Introductionmentioning

confidence: 99%

“…For processing linguistically relevant information, previous studies have taken the approach of adapting automatic speech recognition (ASR) paradigms to singing voice signals [1,2,3,4]. Regarding the musical accompaniment, researchers have aligned lyrics on either source separated vocal tracks [4] or utilized acoustic models trained on polyphonic recordings [2,3,5].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Low Resource Audio-To-Lyrics Alignment from Polyphonic Music Recordings

Demirel

Ahlbäck

Dixon

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Lyrics alignment in long music recordings can be memory exhaustive when performed in a single pass. In this study, we present a novel method that performs audio-to-lyrics alignment with a low memory consumption footprint regardless of the duration of the music recording. The proposed system first spots the anchoring words within the audio signal. With respect to these anchors, the recording is then segmented and a second-pass alignment is performed to obtain the word timings. We show that our audio-to-lyrics alignment system performs competitively with the state-of-the-art, while requiring much less computational resources. In addition, we utilize our lyrics alignment system to segment the music recordings into sentence-level chunks. Notably on the segmented recordings, we report the lyrics transcription scores on a number of benchmark test sets. Finally, our experiments highlight the importance of the source separation step for good performance on the transcription and alignment tasks. For reproducibility, we publicly share our code with the research community.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Audio-to-lyrics Alignmentmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Low Resource Audio-To-Lyrics Alignment from Polyphonic Music Recordings

Demirel

Ahlbäck

Dixon

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

Automatic Label Calibration for Singing Annotation Using Fully Convolutional Neural Network

Deng

2023

IEEJ Transactions Elec Engng

View full text Add to dashboard Cite

Accurately‐labeled data is crucial for the training of machine learning models. For singing‐related tasks in the music information retrieval field, accurately‐labeled data is limited because annotating singing is time‐consuming. Several studies create vocal datasets using a two‐step annotation method which creates coarse labels first and then executes a manual calibration procedure. However, manually calibrating coarsely‐labeled singing data is expensive and time‐consuming. To address this problem, in this study we propose a singing‐label calibration framework, which aims to automatically calibrate the coarsely‐labeled singing data with higher accuracy. This framework contains a data augmentation method to generate training and testing data, a reasonable data preprocessing method to handle music audio and symbolic labels, a fully‐convolutional neural network to estimate the difference between coarse labels and accurate labels, and a novel calibration function to correct the coarse labels. Various experiments are conducted to examine the effect of our research. The results show that our model can highly reduce the cost time and slightly increase the labeling accuracy of the manual calibration process. © 2023 Institute of Electrical Engineers of Japan. Published by Wiley Periodicals LLC.

show abstract