Random points on the boundary of smooth convex bodies

et al. 2017

Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study

Chen¹,

Leung²,

et al. 2015

Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition

Chen

Mak

et al. 2014

An acoustic segment modeling approach to query-by-example spoken term detection

Wang

Lee

et al. 2012

Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection

Chen

et al. 2016

We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in queryby-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.

show abstract

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

Yuan

et al. 2018

We propose to learn acoustic word embeddings with temporal context for query-by-example (QbE) speech search. The temporal context includes the leading and trailing word sequences of a word. We assume that there exist spoken word pairs in the training database. We pad the word pairs with their original temporal context to form fixed-length speech segment pairs. We obtain the acoustic word embeddings through a deep convolutional neural network (CNN) which is trained on the speech segment pairs with a triplet loss. By shifting a fixed-length analysis window through the search content, we obtain a running sequence of embeddings. In this way, searching for the spoken query is equivalent to the matching of acoustic word embeddings. The experiments show that our proposed acoustic word embeddings learned with temporal context are effective in QbE speech search. They outperform the state-of-the-art frame-level feature representations and reduce run-time computation since no dynamic time warping is required in QbE speech search. We also find that it is important to have sufficient speech segment pairs to train the deep CNN for effective acoustic word embeddings. Index Terms: acoustic word embeddings, word pairs, temporal context, triplet loss, query-by-example spoken term detection

show abstract

Shifted-Delta MLP Features for Spoken Language Recognition

Wang

IEEE Signal Process. Lett.

Lee

et al. 2013