Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load

Wu, Bin; Sakti, Sakriani; Zhang, Jinsong; Nakamura, Satoshi

doi:10.1109/taslp.2020.3042016

Cited by 1 publication

(15 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such sensitivity makes DPGMM clustering uncertain for assigning clusters to frames and creates small, random cluster segments inside a phoneme. This is DPGMM's "fragmentation problem" [49].…”

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

“…In unsupervised phoneme discovery, DPGMM tends to suffers from a fragmentation problem when the model encounters the frames from such acoustically complex phonemes as a fricative with noise-like high frequencies or a vowel with rapid formant transitions [49], [50]. DPGMM tends to generate more clusters than the number of phonemes in any human language [30], [50] when it struggles to discriminate between complex phonemes with higher resolution.…”

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

“…We propose to use the DPGMM-RNN hybrid model [49], which enhances DPGMM, to model unsupervised empirical adaptation to improve ASR. The DPGMM-RNN hybrid model 1) improves temporal modeling and 2) relieves fragmentation problems of DPGMM with RNN to relearn the connection between acoustic features and DPGMM cluster labels or posterior vectors by listening to feature chunks instead of concentrating on trivial details at the frame level like DPGMM.…”

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

“…The DPGMM-RNN hybrid model relieved the fragmentation problem and decreased the fragmental level measured by the conditional perplexity [51] and the v-measure [52]. It also reduced the number of clusters of DPGMM [49] and overperformed DPGMM in unsupervised phoneme discovery on datasets from Zerospeech 2019 with an ABX discrimination test at a moderate bitrate [49].…”

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

“…We analyze the fragmental level of the generated representations from the DPGMM or DPGMM-RNN hybrid model with the conditional perplexity of cluster given phoneme [49] that reflects the average number of cluster types corresponding to one phoneme type. We define the conditional perplexity by the exponential of the conditional entropy [51] of the cluster representation (C) given the phoneme truth (T ) with base 2. where n is the number of frames, n t is the number of frames of phoneme truth t, and n ct is the number of frames annotated as phoneme t and clustered as cluster c. We analyze the discriminability (D) between the representations of a phoneme pair (t 1 and t 2 ) by the KL divergence between the conditional distributions of cluster representation (C) given the phoneme.…”

Section: Conditional Perplexity and Kl Divergencementioning

confidence: 99%

See 4 more Smart Citations

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR

Sakti

Zhang

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Speech feature extraction is critical for ASR systems. Such successful features as MFCC and PLP use filterbank techniques to model log-scaled speech perception but fail to model the adaptation of human speech perception by hearing experiences. Infant perception that is adapted by hearing speech without text may cause permanent brain state modifications (engrams) that serve as a physical fundamental basis for lifetime speech perception formation. This realization motivates us to propose to model such an unsupervised adaptation process, where adaptation denotes perception that is affected or changed by the history of experiences, with the Dirichlet Process Gaussian Mixture Model (DPGMM) and the DPGMM-RNN hybrid model to extract perceptual features to improve ASR. Our proposed features extend MFCC features with posteriorgrams extracted from the DPGMM algorithm or the DPGMM-RNN hybrid model. Our analysis shows that the DPGMM and DPGMM-RNN model perplexities agree with infant auditory perplexity to support that the proposed features are perceptual. Our ASR results verify the effectiveness of the proposed unsupervised features in such tasks as LVCSR on WSJ and ASR on noisy low-resource telephone conversations, compared with the supervised bottleneck features from Kaldi in ASR performance.

show abstract

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

Section: Modeling Unsupervised Empirical Adaptation By Dpgmm-rnn Hybr...mentioning

confidence: 99%

Section: Conditional Perplexity and Kl Divergencementioning

confidence: 99%

See 3 more Smart Citations

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR

Sakti

Zhang

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load

Cited by 1 publication

References 56 publications

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR

Contact Info

Product

Resources

About