Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1520
|View full text |Cite
|
Sign up to set email alerts
|

Acoustic Modeling for Automatic Lyrics-to-Audio Alignment

Abstract: Automatic lyrics to polyphonic audio alignment is a challenging task not only because the vocals are corrupted by background music, but also there is a lack of annotated polyphonic corpus for effective acoustic modeling. In this work, we propose (1) using additional speech and music-informed features and (2) adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data. Incorporating additional information such as voicing and audit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
20
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 17 publications
(21 citation statements)
references
References 24 publications
(43 reference statements)
1
20
0
Order By: Relevance
“…Singing vocal extraction was then applied on the test data [7,9,14]. Such acoustic models can be adapted to a small set of extracted vocals to reduce the mismatch of acoustic models between training and testing [11]. Now that we have available a relatively large polyphonic lyrics annotated dataset (DALI) [15], we explore two approaches for acoustic modeling for the task of lyrics transcription and alignment: (1) to apply singing vocal extraction from the polyphonic audio as a pre-processing step, and train acoustic models with the extracted singing vocals, and (2) to train acoustic models using the lyrics annotated polyphonic dataset directly.…”
Section: Singing Vocal Extraction Vs Polyphonic Audiomentioning
confidence: 99%
See 1 more Smart Citation
“…Singing vocal extraction was then applied on the test data [7,9,14]. Such acoustic models can be adapted to a small set of extracted vocals to reduce the mismatch of acoustic models between training and testing [11]. Now that we have available a relatively large polyphonic lyrics annotated dataset (DALI) [15], we explore two approaches for acoustic modeling for the task of lyrics transcription and alignment: (1) to apply singing vocal extraction from the polyphonic audio as a pre-processing step, and train acoustic models with the extracted singing vocals, and (2) to train acoustic models using the lyrics annotated polyphonic dataset directly.…”
Section: Singing Vocal Extraction Vs Polyphonic Audiomentioning
confidence: 99%
“…Moreover, this requires a separate training setup for the singing voice separation system. In our latest work [11], we trained acoustic models on a large amount of solo singing vocals and adapted them towards polyphonic music using a small amount of in-domain data -extracted singing vocals, and polyphonic audio. We found that domain adaptation with polyphonic data outperforms that with extracted singing vocals.…”
Section: Introductionmentioning
confidence: 99%
“…Optimize M with loss Training Details. The training of alignment model is shown as Algorithm 1 13 . In Line 3, we begin training the model for e epochs.…”
Section: A Reproducibility A1 Details In Data Crawlingmentioning
confidence: 99%
“…In our work, we set C to 40, and T to 100. to Line 17, we use DP to find a best spitting boundary and record the reward. From Line 18 to Line 22, we collect the best splitting boundary, starting from the 12 https://github.com/psf/requests 13 We show the training process when mini batch size is 1 for simplicity. Algorithm 2 DP for Duration Extraction 1: Input: Alignment matrix A ∈ R T×S 2: Output: Phoneme duration D ∈ R T 3: Initialize: Initialize reward matrix O ∈ R T×S with zero matrix.…”
Section: A Reproducibility A1 Details In Data Crawlingmentioning
confidence: 99%
See 1 more Smart Citation