Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

Yuan, Yougen; Leung, Cheung-Chi; Xie, Lei; Chen, Hongjie; Ma, Bin; Li, Haizhou

doi:10.21437/interspeech.2018-1010

Cited by 28 publications

(27 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several supervised and unsupervised acoustic embedding methods have been proposed. Supervised methods include convolutional [11][12][13] and recurrent neural network (RNN) models [14][15][16][17], trained with discriminative classification and contrastive losses. Unsupervised methods include using distances to a fixed reference set [10] and unsupervised autoencoding RNNs [18][19][20].…”

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zeroresource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training-testing language pair.Index Terms-Acoustic word embeddings, multilingual models, zero-resource speech processing, query-by-example.

show abstract

Section: Introductionmentioning

confidence: 99%

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Kamper

Matusevych

Goldwater

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…A less direct approach consists of replicating the standard approach of natural language processing (NLP) of representing a word with a fixed-length vector (embedding). In [120,[132][133][134], this is extended by obtaining the word embedding directly from the audio. Once the embeddings are obtained, matching words is trivial and can be done using nearest neighbours [132].…”

Section: Query-by-example Spoken Term Detectionmentioning

confidence: 99%

The Multi-Domain International Search on Speech 2020 ALBAYZIN Evaluation: Overview, Systems, Results, Discussion and Post-Evaluation Analyses

et al. 2021

View full text Add to dashboard Cite

The large amount of information stored in audio and video repositories makes search on speech (SoS) a challenging area that is continuously receiving much interest. Within SoS, spoken term detection (STD) aims to retrieve speech data given a text-based representation of a search query (which can include one or more words). On the other hand, query-by-example spoken term detection (QbE STD) aims to retrieve speech data given an acoustic representation of a search query. This is the first paper that presents an internationally open multi-domain evaluation for SoS in Spanish that includes both STD and QbE STD tasks. The evaluation was carefully designed so that several post-evaluation analyses of the main results could be carried out. The evaluation tasks aim to retrieve the speech files that contain the queries, providing their start and end times and a score that reflects how likely the detection within the given time intervals and speech file is. Three different speech databases in Spanish that comprise different domains were employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the SPARL20 database, which contains Spanish parliament sessions. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the evaluation results and some detailed post-evaluation analyses based on specific query properties (in-vocabulary/out-of-vocabulary queries, single-word/multi-word queries and native/foreign queries). The most novel features of the submitted systems are a data augmentation technique for the STD task and an end-to-end system for the QbE STD task. The obtained results suggest that there is clearly room for improvement in the SoS task and that performance is highly sensitive to changes in the data domain.

show abstract

“…We use the same model in this work and expand it towards learning acoustic embeddings. [5,6,8,7,26,9] all explore ways to learn acoustic word embeddings. All above methods except [7] use unsupervised learning based methods to obtain these embeddings where they do not use the transcripts or do not perform speech recognition.…”

Section: Related Workmentioning

confidence: 99%

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Palaskar

Raunak

Metze

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon. In addition, word models may also be easier to integrate with downstream tasks such as spoken language understanding, because inference (search) is much simplified compared to phoneme, character or any other sort of sub-word units. In this paper, we describe methods to construct contextual acoustic word embeddings directly from a supervised sequence-to-sequence acoustic-to-word speech recognition model using the learned attention distribution. On a suite of 16 standard sentence evaluation tasks, our embeddings show competitive performance against a word2vec model trained on the speech transcriptions. In addition, we evaluate these embeddings on a spoken language understanding task, and observe that our embeddings match the performance of text-based embeddings in a pipeline of first performing speech recognition and then constructing word embeddings from transcriptions.Index Termsacoustic word embeddings, contextual embeddings, attention, acoustic-to-word speech recognition

show abstract

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

Cited by 28 publications

References 22 publications

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

Multilingual Acoustic Word Embedding Models for Processing Zero-resource Languages

The Multi-Domain International Search on Speech 2020 ALBAYZIN Evaluation: Overview, Systems, Results, Discussion and Post-Evaluation Analyses

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Contact Info

Product

Resources

About