Abstract. Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context between objects to understand referring expressions. Our approach uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region. The context regions are discovered using multipleinstance learning (MIL) since annotations for context objects are generally not available for training. We utilize max-margin based MIL objective functions for training the LSTM. Experiments on the Google RefExp and UNC RefExp datasets show that modeling context between objects provides better performance than modeling only object properties. We also qualitatively show that our technique can ground a referring expression to its referred region along with the supporting context region.
In this paper we investigate a time delay neural network (TDNN) for a keyword spotting task that requires low CPU, memory and latency. The TDNN is trained with transfer learning and multi-task learning. Temporal subsampling enabled by the time delay architecture reduces computational complexity. We propose to apply singular value decomposition (SVD) to further reduce TDNN complexity. This allows us to first train a larger full-rank TDNN model which is not limited by-CPU/memory constraints. The larger TDNN usually achieves better performance. Afterwards, its size can be compressed by SVD to meet the budget requirements. Hidden Markov models (HMM) are used in conjunction with the networks to perform keyword detection and performance is measured in terms of area under the curve (AUC) for detection error tradeoff (DET) curves. Our experimental results on a large in-house far-field corpus show that the full-rank TDNN achieves a 19.7% DET AUC reduction compared to a similar-size deep neural network (DNN) baseline. If we train a larger size full-rank TDNN first and then reduce it via SVD to the comparable size of the DNN, we obtain a 37.6% reduction in DET AUC compared to the DNN baseline.
In this paper we present two approaches to improve computational efficiency of a keyword spotting system running on a resource constrained device. This embedded keyword spotting system detects a pre-specified keyword in real time at low cost of CPU and memory. Our system is a two stage cascade. The first stage extracts keyword hypotheses from input audio streams. After the first stage is triggered, hand-crafted features are extracted from the keyword hypothesis and fed to a support vector machine (SVM) classifier on the second stage. This paper focuses on improving the computational efficiency of the second stage SVM classifier. More specifically, select a subset of feature dimensions and merge the SVM classifier to a smaller size, while maintaining the keyword spotting performance. Experimental results indicate that we can remove more than 36% of the nondiscriminative SVM features, and reduce the number of support vectors by more than 60% without significant performance degradation. This results in more than 15% relative reduction in CPU utilization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.