2019
DOI: 10.3389/frobt.2019.00092
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Phoneme and Word Discovery From Multiple Speakers Using Double Articulation Analyzer and Neural Network With Parametric Bias

Abstract: This paper describes a new unsupervised machine learning method for simultaneous phoneme and word discovery from multiple speakers. Human infants can acquire knowledge of phonemes and words from interactions with his/her mother as well as with others surrounding him/her. From a computational perspective, phoneme and word discovery from multiple speakers is a more challenging problem than that from one speaker because the speech signals from different speakers exhibit different acoustic features. This paper pro… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 45 publications
0
5
0
Order By: Relevance
“…I framed both as instances of a duration-penalized dynamic programming (DPDP) procedure, where a self-supervised neural scoring function is combined with a penalty term that encourages longer segments. I chained the two models to do word segmentation from speech and compared this 6 To get the recall for a particular word type, the number of correctly segmented word tokens (where both boundaries are correct without an intermediate prediction) are divided by the total number of tokens of this type. to existing direct and joint self-supervised approaches on standard benchmarks.…”
Section: Conclusion Discussion and Future Workmentioning
confidence: 99%
See 3 more Smart Citations
“…I framed both as instances of a duration-penalized dynamic programming (DPDP) procedure, where a self-supervised neural scoring function is combined with a penalty term that encourages longer segments. I chained the two models to do word segmentation from speech and compared this 6 To get the recall for a particular word type, the number of correctly segmented word tokens (where both boundaries are correct without an intermediate prediction) are divided by the total number of tokens of this type. to existing direct and joint self-supervised approaches on standard benchmarks.…”
Section: Conclusion Discussion and Future Workmentioning
confidence: 99%
“…A segmentation S can be specified as a sequence of (start, end) tuples. The red-solid path in Figure 1 corresponds to the segmentation S = ((1, 3), (4, 4), (5,6), (7,7), (8,9)), giving the result "alo n gt i me". The blue-dashed path is a different segmentation with a different cost.…”
Section: Duration-penalized Dynamic Programming (Dpdp)mentioning
confidence: 99%
See 2 more Smart Citations
“…The language model represents the transition probabilities between words, and the acoustic model represents the relationship between each phoneme and acoustic features. Acoustic features are obtained by dimensional compression of the speech spectrum by conversion to the Mel Frequency Cepstral Coefficient (MFCC) or deep sparse autoencoder with parametric bias in the hidden layer (DSAE-PBHL) [16].…”
Section: B Existing Computational Models For Double Articulation Anal...mentioning
confidence: 99%