Acoustic Segment Modeling with Spectral Clustering Methods

Wang, Haipeng; Lee, Tan; Leung, Cheung-Chi; Ma, Bin; Li, Haizhou

doi:10.1109/taslp.2014.2387382

Cited by 53 publications

(43 citation statements)

References 54 publications

(88 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To overcome these limitations, model-based approaches have been investigated [24], [33], [34]. These approaches primarily rely on acoustic units discovered in an unsupervised manner.…”

Section: Prior Workmentioning

confidence: 99%

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Ram

Asaei

Bourlard

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. Current state-of-the-art approaches to tackle this problem rely on dynamic programming based template matching techniques using phone posterior features extracted at the output of a deep neural network (DNN). Previously, it has been shown that the space of phone posteriors is highly structured, as a union of low-dimensional subspaces. To exploit the temporal and sparse structure of the speech data, we investigate here three different QbE-STD systems based on sparse model recovery. More specifically, we use query examples to model the query subspace using dictionary for sparse coding. Reconstruction errors calculated using sparse representation of feature vectors are then used to characterize the underlying subspaces. The first approach uses these reconstruction errors in a dynamic programming framework to detect the spoken query, resulting in a much faster search compared to standard template matching. The other two methods aim at merging template matching and sparsity based approaches to further improve the performance. The first one proposes to regularize the template matching local distances using sparse reconstruction errors. The second approach aims at using the sparse reconstruction errors to rescore (improve) the template matching likelihood. Experiments on two different databases (AMI and MediaEval) show that the proposed hybrid systems perform better than a highly competitive QbE-STD baseline system.

show abstract

“…To overcome these limitations, model-based approaches have been investigated [24], [33], [34]. These approaches primarily rely on acoustic units discovered in an unsupervised manner.…”

Section: Prior Workmentioning

confidence: 99%

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Ram

Asaei

Bourlard

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Open-source tools [13] are used to train FHVAEs. 1 "speakers-R/-L" denotes speakers with rich/limited speech data. In our preliminary experiments, the ABX performance of z1 was found to be sensitive to the input segment length l. This could be explained as: a too large l would reduce the capability of z1 in modeling linguistic content at subword level; a too small l would restrict the FHVAE from capturing sufficient temporal dependencies which are essential in modeling speech.…”

Section: Fhvae Setup and Parameter Tuningmentioning

confidence: 99%

“…UAM is a challenging problem with significant practical impact in speech as well as linguistics and cognitive science communities. It has been studied in applications such as ASR for low-resource languages [1], language identification [2] and query-by-example spoken term detection [3]. This problem is also relevant to endangered language protection [4] and understanding infants' language acquisition mechanism [5].…”

Section: Introductionmentioning

confidence: 99%

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

Feng

Lee

2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

This study tackles unsupervised subword modeling in the zeroresource scenario, learning frame-level speech representation that is phonetically discriminative and speaker-invariant, using only untranscribed speech for target languages. Frame label acquisition is an essential step in solving this problem. High quality frame labels should be in good consistency with golden transcriptions and robust to speaker variation. We propose to improve frame label acquisition in our previously adopted deep neural network-bottleneck feature (DNN-BNF) architecture by applying the factorized hierarchical variational autoencoder (FHVAE). FHVAEs learn to disentangle linguistic content and speaker identity information encoded in speech. By discarding or unifying speaker information, speaker-invariant features are learned and fed as inputs to DPGMM frame clustering and DNN-BNF training. Experiments conducted on ZeroSpeech 2017 show that our proposed approaches achieve 2.4% and 0.6% absolute ABX error rate reductions in acrossand within-speaker conditions, comparing to the baseline DNN-BNF system without applying FHVAEs. Our proposed approaches significantly outperform vocal tract length normalization in improving frame labeling and subword modeling.

show abstract

“…Unsupervised spoken term detection techniques, which aim at automatically discovering acoustic patterns (e.g., for training acoustic models) for languages for which manual transcriptions and linguistic knowledge are scarce, have been also investigated [34,35]. These techniques can also be employed for building language-independent QbE STD systems, since prior knowledge of the language is not necessary.…”

Section: Introductionmentioning

confidence: 99%

Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

Tejedor

Toledano

López-Otero³

et al. 2016

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Query-by-example spoken term detection (QbE STD) aims at retrieving data from a speech repository given an acoustic query containing the term of interest as input. Nowadays, it is receiving much interest due to the large volume of multimedia information. This paper presents the systems submitted to the ALBAYZIN QbE STD 2014 evaluation held as a part of the ALBAYZIN 2014 Evaluation campaign within the context of the IberSPEECH 2014 conference. This is the second QbE STD evaluation in Spanish, which allows us to evaluate the progress in this technology for this language. The evaluation consists in retrieving the speech files that contain the input queries, indicating the start and end times where the input queries were found, along with a score value that reflects the confidence given to the detection of the query. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from workshops, which amount to about 7 h of speech. We present the database, the evaluation metric, the systems submitted to the evaluation, the results, and compare this second evaluation with the first ALBAYZIN QbE STD evaluation held in 2012. Four different research groups took part in the evaluations held in 2012 and 2014. In 2014, new multi-word and foreign queries were added to the single-word and in-language queries used in 2012. Systems submitted to the second evaluation are hybrid systems which integrate letter transcription-and template matching-based systems. Despite the significant improvement obtained by the systems submitted to this second evaluation compared to those of the first evaluation, results still show the difficulty of this task and indicate that there is still room for improvement.

show abstract

Acoustic Segment Modeling with Spectral Clustering Methods

Cited by 53 publications

References 54 publications

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation

Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations

Contact Info

Product

Resources

About