Abstract:Abstract-This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. Current state-of-the-art approaches to tackle this problem rely on dynamic programming based template matching techniques using phone posterior features extracted at the output of a deep neural network (DNN). Previously, it has been shown that the space of phone posteriors is highly structured, as a union of low-dimensional subspaces. To exploit the temporal and sparse structure of the spee… Show more
“…This observation is opposite to the results previously obtained on a clean (simple) database [27] where incorporating more examples of the query were found more effective for the sparse method compared to the baseline DTW system. This issue can be attributed to the large variability and overlap present in the utterances of AMI corpus.…”
Section: Qbe-std Performancecontrasting
confidence: 99%
“…Alternative to the average reconstruction error of the background phone dictionaries, the minimum of them can also be used as the background score [16,27]. However, we found that the average score yields better detection performance.…”
Section: Subspace Modeling and Detectionmentioning
We cast the query by example spoken term detection (QbE-STD) problem as subspace detection where query and background subspaces are modeled as union of low-dimensional subspaces. The speech exemplars used for subspace modeling are class-conditional posterior probabilities estimated using deep neural network (DNN). The query and background training exemplars are exploited to model the underlying lowdimensional subspaces through dictionary learning for sparse representation. Given the dictionaries characterizing the query and background subspaces, QbE-STD is performed based on the ratio of the two corresponding sparse representation reconstruction errors. The proposed subspace detection method can be formulated as the generalized likelihood ratio test for composite hypothesis testing. The experimental evaluation demonstrate that the proposed method is able to detect the query given a single example and performs significantly better than a highly competitive QbE-STD baseline system based on dynamic time warping (DTW) for exemplar matching.
“…This observation is opposite to the results previously obtained on a clean (simple) database [27] where incorporating more examples of the query were found more effective for the sparse method compared to the baseline DTW system. This issue can be attributed to the large variability and overlap present in the utterances of AMI corpus.…”
Section: Qbe-std Performancecontrasting
confidence: 99%
“…Alternative to the average reconstruction error of the background phone dictionaries, the minimum of them can also be used as the background score [16,27]. However, we found that the average score yields better detection performance.…”
Section: Subspace Modeling and Detectionmentioning
We cast the query by example spoken term detection (QbE-STD) problem as subspace detection where query and background subspaces are modeled as union of low-dimensional subspaces. The speech exemplars used for subspace modeling are class-conditional posterior probabilities estimated using deep neural network (DNN). The query and background training exemplars are exploited to model the underlying lowdimensional subspaces through dictionary learning for sparse representation. Given the dictionaries characterizing the query and background subspaces, QbE-STD is performed based on the ratio of the two corresponding sparse representation reconstruction errors. The proposed subspace detection method can be formulated as the generalized likelihood ratio test for composite hypothesis testing. The experimental evaluation demonstrate that the proposed method is able to detect the query given a single example and performs significantly better than a highly competitive QbE-STD baseline system based on dynamic time warping (DTW) for exemplar matching.
“…Recent exemplar based speech processing offers high flexibility in speech applications, partly attributed to the lack of complex statistical assumptions that facilitate exploiting "data deluge" with no prejudice on expected answers. Deep neural network (DNN) based class-conditional posterior probabilities (hereafter referred to as posteriors) have been found to be one of the best speech representations to enable exemplar based speech recognition [4] and spoken query detection [5,6,7]. In theory, if infinite number of exemplars of continuous probability density functions are provided, a simple nearest-neighbor rule leads to optimal classification [8].…”
Section: State-of-the-art Solutions and Challengesmentioning
confidence: 99%
“…In addition, the low-dimensional subspaces can be modeled through dictionary learning for sparse coding to enable unsupervised adaptation and enhanced acoustic modeling for speech recognition [10,12]. Sparse subspace modeling of the posterior exemplars are also found promising for query-by-example spoken term detection (QbE-STD) [7,11,13].…”
Section: State-of-the-art Solutions and Challengesmentioning
confidence: 99%
“…We use max-sum dynamic programming to obtain a region of occurrence and the corresponding area under the curve is used as the score for query detection [7]. This procedure is illustrated in Fig.…”
State of the art query by example spoken term detection (QbE-STD) systems in zero-resource conditions rely on representation of speech in terms of sequences of class-conditional posterior probabilities estimated by deep neural network (DNN). The posteriors are often used for pattern matching or dynamic time warping (DTW). Exploiting posterior probabilities as speech representation propounds diverse advantages in a classification system. One key property of the posterior representations is that they admit a highly effective hashing strategy that enables indexing a large audio archive in divisions for reducing the search complexity. Moreover, posterior indexing leads to a compressed representation and enables pronunciation dewarping and partial detection with no need for DTW. We exploit these characteristics of the posterior space in the context of redundant hash addressing for query-by-example spoken term detection (QbE-STD). We evaluate the QbE-STD system on AMI corpus and demonstrate that tremendous speedup and superior accuracy is achieved compared to the state-of-the-art pattern matching solution based on DTW. The system has the potential to enable massively large scale spoken query detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.