Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval

Lee, Lin-Shan; Glass, James; Lee, Hung-yi; Chan, Chin-Feng

doi:10.1109/taslp.2015.2438543

Cited by 100 publications

(58 citation statements)

References 289 publications

(299 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All the queries are distributed across all the speakers such that at least one speaker contains at least one query. The performance of QbE-STD is measured in terms of precision@N (p@N) and Mean Average Precision (MAP) [2]. The value of N varies according to the query (from 7 to 20).…”

Section: A Experimental Datasetmentioning

confidence: 99%

“…QbE-STD directly exploits the acoustic-level information for matching between spoken documents and a spoken query without transcribing them into phonemes or words. QbE-STD is important for low-resourced languages and under non-mainstream conditions and hence, it was also called an unsupervised STD [1], [2]. As a part of the MediaEval campaign, the Spoken Web Search (SWS) was started in 2011 [3].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VTLN-warped Gaussian posteriorgram for QbE-STD

Madhavi

Patil

2017

2017 25th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

Abstract-Vocal Tract Length Normalization (VTLN) is a very important speaker normalization technique for speech recognition tasks. In this paper, we propose the use of Gaussian posteriorgram of VTLN-warped spectral features for a Queryby-Example Spoken Term Detection (QbE-STD). This paper presents the use of a Gaussian Mixture Model (GMM) framework for estimation of VTLN warping factor. This GMM framework does not require phoneme-level transcription and hence, it can be useful for unsupervised tasks. We propose the iterative approach for VTLN warping factor estimation with two GMM training approaches, namely, Expectation-Maximization (EM) and Deterministic Annealing-Expectation Maximization (DAEM). The VTLN-warped Gaussian posteriorgram gave the better QbE-STD performance. The performance of TIMIT QbE-STD was investigated with different evaluation factors, such as a number of Gaussian components in GMM, various local constraints, and a number of iterations in VTLN warping factor estimation. VTLNwarped Gaussian posteriorgram reduces the speaker-specific variation in Gaussian posteriorgram and hence, it is expected to give better performance than Gaussian posteriorgram.

show abstract

Section: A Experimental Datasetmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

VTLN-warped Gaussian posteriorgram for QbE-STD

Madhavi

Patil

2017

2017 25th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

show abstract

“…These systems can be broadly classified into following categories [2]. Cascaded ASR with text information retrieval: The spoken content is converted into word or sub word sequences or lattices using ASR and then text retrieval techniques are applied.…”

Section: Overviewmentioning

confidence: 99%

Usage of acoustic cues in spoken term detection keyword spotting for zero low resource languages

Sreedhar¹,

Suryakanth²

2017

Fifth International Conference on Advances in Computing, Communication and Information Technology - CCIT 2017

View full text Add to dashboard Cite

Abstract:The proposed work exploits acoustic cues at various levels and incorporates them in the present (Spoken Term Detection) STD frame work. Recently proposed new syllabification method [1] for speech signal is being used for STD. In STD, a query and reference speech signals are provided, these speech signals are syllabified using the new syllabification method and features like Mel-frequency cepstral coefficients (MFCC), posterior grams are extracted. These features are then matched using template based match techniques like dynamic time warping (DTW) at syllable level instead of regular frame level. This essentially reduces the unwanted matching done at frame level.

show abstract

“…Nowadays, it is receiving much importance due to the large volume of multimedia information. Research and technology improvements in automated speech recognition successfully achieved the information retrieval by using the transcribed textual form of the spoken contents [1]. Similarly, due to the exponential growth of internet and multimedia contents, the STD methods have been achieving much popularity.…”

Section: Introductionmentioning

confidence: 99%

An Intelligent System for Spoken Term Detection That Uses Belief Combination

Khan¹,

Kuru²

2017

IEEE Intell. Syst.

View full text Add to dashboard Cite

Spoken Term Detection (STD) can be considered as a sub-part of the automatic speech recognition which aims to extract the partial information from speech signals in the form of query utterances. A variety of STD techniques available in the literature employ a single source of evidence for the query utterance match/mismatch determination. In this manuscript, we develop an acoustic signal processing based approach for STD that incorporates a number of techniques for silence removal, dynamic noise filtration, and evidence combination using Dempster-Shafer Theory (DST). A "spectral-temporal features based voiced segment detection" and "energy and zero cross rate based unvoiced segment detection" are built to remove the silence segments in the speech signal. Comprehensive experiments have been performed on large speech datasets and consequently satisfactory results have been achieved with the proposed approach. Our approach improves the existing speaker dependent STD approaches, specifically the reliability of query utterance spotting by combining the evidences from multiple belief sources.Keywords: Spoken term detection, Acoustic keyword spotting, Query-by-example, Dempster-Shafer"s theory, Speech recognition, Speech processing. Acknowledgement:A special gratitude we give to Prof. Daniel Neagu, University of Bradford, whose contribution in stimulating suggestions helped us to coordinate this research in terms of statistical analysis, and performance evaluation methods.

show abstract

Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval

Cited by 100 publications

References 289 publications

VTLN-warped Gaussian posteriorgram for QbE-STD

VTLN-warped Gaussian posteriorgram for QbE-STD

Usage of acoustic cues in spoken term detection keyword spotting for zero low resource languages

An Intelligent System for Spoken Term Detection That Uses Belief Combination

Contact Info

Product

Resources

About