ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053053
|View full text |Cite
|
Sign up to set email alerts
|

Phoneme Boundary Detection Using Learnable Segmental Features

Abstract: Phoneme boundary detection plays an essential first step for a variety of speech processing applications such as speaker diarization, speech science, keyword spotting, etc. In this work, we propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection. First, we evaluated our model when the spoken phonemes were not given as input. Results on the TIMIT and Buckeye corpora suggest that the proposed model is superi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(9 citation statements)
references
References 22 publications
(33 reference statements)
0
9
0
Order By: Relevance
“…For phone segmentation on TIMIT, we simply hypothesize a boundary when a code changes. Our results do not rely on peak detection [24,22] or any segmentation algorithms [26]. We find that fewer codes leads to fewer transitions, showing good results on phone segmentation.…”
Section: Analysis Of Learned Codesmentioning
confidence: 70%
See 1 more Smart Citation
“…For phone segmentation on TIMIT, we simply hypothesize a boundary when a code changes. Our results do not rely on peak detection [24,22] or any segmentation algorithms [26]. We find that fewer codes leads to fewer transitions, showing good results on phone segmentation.…”
Section: Analysis Of Learned Codesmentioning
confidence: 70%
“…Following [5], we also evaluate phone cluster purity on WSJ si284. For phone segmentation on TIMIT, we do not follow the setting in other studies [24,22]. The published results themselves are not entirely comparable due to subtle differences, such as data set split and preprocessing protocols.…”
Section: Methodsmentioning
confidence: 99%
“…Although the simple DGP does not capture all dynamics of the speech signal, 7 out of 9 phoneme boundaries were correctly identified, with a time tolerance of 20 ms. A baseline detector that predicts segment boundaries from a uniform distribution was as good or better only in 69 out of 10000 runs (< 1%). This minimal experiment suggests that relaxed segmented models, when combined with more powerful DGPs, may be useful for discrete representation learning [43,46,14], in particular for learning segmental embeddings [27,48,9,29]. We consider this a fruitful direction for future work.…”
Section: Phoneme Segmentationmentioning
confidence: 98%
“…When the orthographic or phonetic information is not given (and the speech recognition is not employed), we speak about unsupervised speech segmentation 37,38 or phone/phoneme boundary detection. 39,40 This task usually considers phones/phonemes as segments with nearly invariable acoustic features, whereas the boundaries between them are points where these features are significantly changed. 41 The resulting segmentation may contain artificial phone-like units that may not correspond to actual phones.…”
Section: Related Workmentioning
confidence: 99%
“…The model can be used directly or after adaptation to the new language. 56,57 The cross-lingual modeling is also important for under-resourced languages, for example, Vietnamese 58 or Hebrew, 40 where a cross-lingual system often performs better than a simple monolingual one.…”
Section: Related Workmentioning
confidence: 99%