HMM-Based Speech Segmentation: Improvements of Fully Automatic Approaches

Brognaux, Sandrine; Drugman, Thomas

doi:10.1109/taslp.2015.2456421

Cited by 31 publications

(10 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The baseline is a 1-state monophone DNN/HSMM model. We use monophone model because our small dataset doesn't have enough phoneme instances for exploring the context-dependent triphones model, also Brognaux and Drugman [6] and Pakoci et al [10] argued that context-dependent model can't bring significant alignment improvement. It is convenient to apply 1-state model because each phoneme can be represented by a semi-Markovian state carrying a state occupancy time distribution.…”

Section: Baseline Methodsmentioning

confidence: 99%

“…MFCCs) to the HMM states. Brognaux and Drugman [6] explored the forced alignment on a small dataset using supplementary acoustic features and initializing the silence model by voice activity detection algorithm. To predict the confidence measure of the aligned word boundaries and to fine-tune their time positions, Serriére et al [7] explored an alignment postprocessing method using a deep neural network (DNN).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Gong

Serra

2018

Interspeech 2018

View full text Add to dashboard Cite

In this paper, we tackle the singing voice phoneme segmentation problem in the singing training scenario by using languageindependent information -onset and prior coarse duration. We propose a two-step method. In the first step, we jointly calculate the syllable and phoneme onset detection functions (ODFs) using a convolutional neural network (CNN). In the second step, the syllable and phoneme boundaries and labels are inferred hierarchically by using a duration-informed hidden Markov model (HMM). To achieve the inference, we incorporate the a priori duration model as the transition probabilities and the ODFs as the emission probabilities into the HMM. The proposed method is designed in a language-independent way such that no phoneme class labels are used. For the model training and algorithm evaluation, we collect a new jingju (also known as Beijing or Peking opera) solo singing voice dataset and manually annotate the boundaries and labels at phrase, syllable and phoneme levels. The dataset is publicly available. The proposed method is compared with a baseline method based on hidden semi-Markov model (HSMM) forced alignment. The evaluation results show that the proposed method outperforms the baseline by a large margin regarding both segmentation and onset detection tasks.

show abstract

Section: Baseline Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Gong

Serra

2018

Interspeech 2018

View full text Add to dashboard Cite

show abstract

“…These features are the short-term energy, zero crossing rate and the singularity exponents calculated in each point of signal. While (Brognaux and Thomas, 2016) focuses on a particular case of hidden Markov model (HMM)-based forced alignment in which the models are directly trained on the corpus to align. Kamper et al (2017) introduces an approximation to a recent Bayesian model that still has a clear objective function but improves efficiency by using hard clustering and segmentation rather than full Bayesian inference.…”

Section: Word (N)mentioning

confidence: 99%

Speech Segmentation Using Dynamic Windows and Thresholds for Arabic and English Languages

Jazyah¹

2018

Journal of Computer Science

View full text Add to dashboard Cite

Segmentation of audio data such as human speech (splitting each word in separate audio file-.WAV file) has been a major concern when working with multimedia such as recordings from radio or TV. The main focus of the segmentation of boundaries of spoken language has been on using energy and zero crossing thresholds for endpoint detection. Errors in endpoint detection are still a main cause of low accuracy of segmentation systems. The goal of this research is to develop an efficient algorithm in order to segment the speech of human in both languages of English and Arabic in different speaking speed with high accuracy. Simulation results show that the developed algorithm achieved high accuracy when segmenting human speech in English language up to 91.6% in average, while it is 89.0% of Arabic language.

show abstract

“…Since the phone set and G2P components for each language needs to be developed before script preparation and given that it is possible to perform automatic phonetic alignment with as few as 20 utterances [9,10], it may be worthwhile attempting to develop a tool that can automatically flag potentially significant divergences in pronunciation for manual inspection once a prototypical speaker has been identified.…”

Section: Observations and Commentsmentioning

confidence: 99%

Rapid Development of TTS Corpora for Four South African Languages

Niekerk¹,

Heerden²,

Davel³

et al. 2017

Interspeech 2017

View full text Add to dashboard Cite

This paper describes the development of text-to-speech corpora for four South African languages. The approach followed investigated the possibility of using low-cost methods including informal recording environments and untrained volunteer speakers. This objective and the additional future goal of expanding the corpus to increase coverage of South Africa's 11 official languages necessitated experimenting with multi-speaker and code-switched data. The process and relevant observations are detailed throughout. The latest version of the corpora are available for download under an open-source licence and will likely see further development and refinement in future.

show abstract

HMM-Based Speech Segmentation: Improvements of Fully Automatic Approaches

Cited by 31 publications

References 34 publications

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions

Speech Segmentation Using Dynamic Windows and Thresholds for Arabic and English Languages

Rapid Development of TTS Corpora for Four South African Languages

Contact Info

Product

Resources

About