2016
DOI: 10.1109/taslp.2015.2456421
|View full text |Cite
|
Sign up to set email alerts
|

HMM-Based Speech Segmentation: Improvements of Fully Automatic Approaches

Abstract: Speech segmentation refers to the problem of determining the phoneme boundaries from an acoustic recording of an utterance together with its orthographic transcription. This paper focuses on a particular case of Hidden Markov Model (HMM) based forced alignment in which the models are directly trained on the corpus to align. The obvious advantage of this technique is that it is applicable to any language or speaking style and does not require manually-aligned data. Through a systematic stepby-step study, the ro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 31 publications
(10 citation statements)
references
References 34 publications
0
10
0
Order By: Relevance
“…The baseline is a 1-state monophone DNN/HSMM model. We use monophone model because our small dataset doesn't have enough phoneme instances for exploring the context-dependent triphones model, also Brognaux and Drugman [6] and Pakoci et al [10] argued that context-dependent model can't bring significant alignment improvement. It is convenient to apply 1-state model because each phoneme can be represented by a semi-Markovian state carrying a state occupancy time distribution.…”
Section: Baseline Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The baseline is a 1-state monophone DNN/HSMM model. We use monophone model because our small dataset doesn't have enough phoneme instances for exploring the context-dependent triphones model, also Brognaux and Drugman [6] and Pakoci et al [10] argued that context-dependent model can't bring significant alignment improvement. It is convenient to apply 1-state model because each phoneme can be represented by a semi-Markovian state carrying a state occupancy time distribution.…”
Section: Baseline Methodsmentioning
confidence: 99%
“…MFCCs) to the HMM states. Brognaux and Drugman [6] explored the forced alignment on a small dataset using supplementary acoustic features and initializing the silence model by voice activity detection algorithm. To predict the confidence measure of the aligned word boundaries and to fine-tune their time positions, Serriére et al [7] explored an alignment postprocessing method using a deep neural network (DNN).…”
Section: Related Workmentioning
confidence: 99%
“…These features are the short-term energy, zero crossing rate and the singularity exponents calculated in each point of signal. While (Brognaux and Thomas, 2016) focuses on a particular case of hidden Markov model (HMM)-based forced alignment in which the models are directly trained on the corpus to align. Kamper et al (2017) introduces an approximation to a recent Bayesian model that still has a clear objective function but improves efficiency by using hard clustering and segmentation rather than full Bayesian inference.…”
Section: Word (N)mentioning
confidence: 99%
“…Since the phone set and G2P components for each language needs to be developed before script preparation and given that it is possible to perform automatic phonetic alignment with as few as 20 utterances [9,10], it may be worthwhile attempting to develop a tool that can automatically flag potentially significant divergences in pronunciation for manual inspection once a prototypical speaker has been identified.…”
Section: Observations and Commentsmentioning
confidence: 99%