A segmental framework for fully-unsupervised large-vocabulary speech recognition

Kamper, Herman; Jansen, Aren; Goldwater, Sharon

doi:10.1016/j.csl.2017.04.008

Cited by 90 publications

(134 citation statements)

References 53 publications

(162 reference statements)

Supporting

Mentioning

132

Contrasting

Order By: Relevance

“…A recently emerged area of speech technology research is the so-called zero-resource speech processing (ZS) initiative where the aim is to create systems capable of learning structural representations of speech input in the absence of any data labeling [1][2][3], providing both scalability towards under-resourced domains and illuminating how human infants may learn spoken languages. A number of the existing ZS systems, including the best performing system at the word-level [1] in the Interspeech-2015 Zerospeech challenge and the state-of-the-art system in [2] are based on clustering and temporal grouping of syllable-like rhythmic units.…”

Section: Introductionmentioning

confidence: 99%

“…A number of the existing ZS systems, including the best performing system at the word-level [1] in the Interspeech-2015 Zerospeech challenge and the state-of-the-art system in [2] are based on clustering and temporal grouping of syllable-like rhythmic units. The system in [1] first segments speech into syllable-like chunks, clusters the resulting tokens into categories using K-means, and decodes words as recurring n-grams over the syllabic clusters in the data.…”

Section: Introductionmentioning

confidence: 99%

“…The system in [1] first segments speech into syllable-like chunks, clusters the resulting tokens into categories using K-means, and decodes words as recurring n-grams over the syllabic clusters in the data. The work in [2] extends this method by creating a Bayesian segmental model that jointly optimizes word category identities (clustering, using a Bayesian GMM) and boundaries chosen from the syllable-like chunks (segmentation pruning). However, both systems used a heuristically set number of clusters for the data.…”

Section: Introductionmentioning

confidence: 99%

“…In addition, the fixed-dimensional spectral representations such as those used in [1,2] could also be modelled using some other parametric distribution than GMM. One candidate is the cosine distance -based von Mises-Fisher mixture model (here: VMM) [14] that is more suited for high-dimensional density estimation than GMMs [15] as long as the feature vectors can be unit normalised before clustering (see, e.g., [2]).…”

Section: Introductionmentioning

confidence: 99%

“…One candidate is the cosine distance -based von Mises-Fisher mixture model (here: VMM) [14] that is more suited for high-dimensional density estimation than GMMs [15] as long as the feature vectors can be unit normalised before clustering (see, e.g., [2]). …”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Comparison of Non-Parametric Bayesian Mixture Models for Syllable Clustering and Zero-Resource Speech Processing

Seshadri¹,

Remes²,

Räsänen³

2017

Interspeech 2017

View full text Add to dashboard Cite

Zero-resource speech processing (ZS) systems aim to learn structural representations of speech without access to labeled data. A starting point for these systems is the extraction of syllable tokens utilizing the rhythmic structure of a speech signal. Several recent ZS systems have therefore focused on clustering such syllable tokens into linguistically meaningful units. These systems have so far used heuristically set number of clusters, which can, however, be highly dataset dependent and cannot be optimized in actual unsupervised settings. This paper focuses on improving the flexibility of ZS systems using Bayesian non-parametric (BNP) mixture models that are capable of simultaneously learning the cluster models as well as their number based on the properties of the dataset. We also compare different model design choices, namely priors over the weights and the cluster component models, as the impact of these choices is rarely reported in the previous studies. Experiments are conducted using conversational speech from several languages. The models are first evaluated in a separate syllable clustering task and then as a part of a full ZS system in order to examine the potential of BNP methods and illuminate the relative importance of different model design choices.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%