A hierarchical system for word discovery exploiting DTW-based initialization

Walter, Oliver; Korthals, Timo; Haeb‐Umbach, Reinhold; Raj, Bhiksha

doi:10.1109/asru.2013.6707761

Cited by 30 publications

(60 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While useful, the discovered patterns are typically isolated segments spread out over the data, leaving much speech as background. This has prompted several studies on full-coverage approaches, where the entire speech input is segmented and clustered into word-like units [17][18][19][20][21].…”

Section: Introductionmentioning

confidence: 99%

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Kamper

Livescu

Goldwater

2017

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-resource speech processing. Most approaches lie at methodological extremes: some use probabilistic Bayesian models with convergence guarantees, while others opt for more efficient heuristic techniques. Despite competitive performance in previous work, the full Bayesian approach is difficult to scale to large speech corpora. We introduce an approximation to a recent Bayesian model that still has a clear objective function but improves efficiency by using hard clustering and segmentation rather than full Bayesian inference. Like its Bayesian counterpart, this embedded segmental K-means model (ES-KMeans) represents arbitrary-length word segments as fixed-dimensional acoustic word embeddings. We first compare ES-KMeans to previous approaches on common English and Xitsonga data sets (5 and 2.5 hours of speech): ES-KMeans outperforms a leading heuristic method in word segmentation, giving similar scores to the Bayesian model while being 5 times faster with fewer hyperparameters. However, its clusters are less pure than those of the other models. We then show that ES-KMeans scales to larger corpora by applying it to the 5 languages of the Zero Resource Speech Challenge 2017 (up to 45 hours), where it performs competitively compared to the challenge baseline. 1

show abstract

Section: Introductionmentioning

confidence: 99%

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Kamper

Livescu

Goldwater

2017

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…Figure 6 shows an example of a two level hierarchical representation of a speech signal. On the first hierarchical level the aim is to discover the acoustic building blocks of speech, the phonemes, and to learn a statistical model for each of them, the acoustic model [11,56,53,47]. In speech recognition, the acoustic model usually consists of Hidden Markov Models (HMMs), where each HMM emits a time series of vectors of cepstral coefficients.…”

Section: Representation Learning From Sequential Datamentioning

confidence: 99%

Autonomous Learning of Representations

et al. 2015

Self Cite

View full text Add to dashboard Cite

Besides the core learning algorithm itself, one major question in machine learning is how to best encode given training data such that the learning technology can efficiently learn based thereon and generalize to novel data. While classical approaches often rely on a hand coded data representation, the topic of autonomous representation or feature learning plays a major role in modern learning architectures. The goal of this contribution is to give an overview about different principles of autonomous feature learning, and to exemplify two principles based on two recent examples: autonomous metric learning for sequences, and autonomous learning of a deep representation for spoken language, respectively.

show abstract

“…In the related task of acoustic pattern discovery, DTW can be allowed to consider multiple local alignments between speech signals during the overall search [8]. In this way DTW can find similar segment pairs in speech audio, followed by a clustering step [9]. The resulting cluster labels are used to train hidden Markov models (HMMs).…”

Section: Introductionmentioning

confidence: 99%

Feature trajectory dynamic time warping for clustering of speech segments

Lerato

Niesler

2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Dynamic time warping (DTW) can be used to compute the similarity between two sequences of generally differing length. We propose a modification to DTW that performs individual and independent pairwise alignment of feature trajectories. The modified technique, termed feature trajectory dynamic time warping (FTDTW), is applied as a similarity measure in the agglomerative hierarchical clustering of speech segments. Experiments using MFCC and PLP parametrisations extracted from TIMIT and from the Spoken Arabic Digit Dataset (SADD) show consistent and statistically significant improvements in the quality of the resulting clusters in terms of F-measure and normalised mutual information (NMI).

show abstract

A hierarchical system for word discovery exploiting DTW-based initialization

Cited by 30 publications

References 12 publications

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Autonomous Learning of Representations

Feature trajectory dynamic time warping for clustering of speech segments

Contact Info

Product

Resources

About