Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets

Sainath, Tara N.; Kingsbury, Brian; Sindhwani, Vikas; Arisoy, Ebru; Ramabhadran, Bhuvana

doi:10.1109/icassp.2013.6638949

Cited by 490 publications

(279 citation statements)

References 9 publications

Supporting

Mentioning

259

Contrasting

Unclassified

Order By: Relevance

“…test ER (%) Gaussian mixture model (GMM) [23] 26.3 SVM [23] 22.4 Hierarchical GMM [22] 21.0 Discriminative hierarchical GMM [24] 16.8 SVM with deep scattering spectrum [25] 15.9 our CNN ensemble 15.0 the softmax [21]. Each frame is labeled with its segment label and one additional label from a neighboring segment.…”

Section: Other Expensive Featuresmentioning

confidence: 99%

Discriminative segmental cascades for feature-rich phone recognition

Tang

Wang

Gimpel

et al. 2015

2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)

View full text Add to dashboard Cite

Discriminative segmental models, such as segmental conditional random fields (SCRFs) and segmental structured support vector machines (SSVMs), have had success in speech recognition via both lattice rescoring and first-pass decoding. However, such models suffer from slow decoding, hampering the use of computationally expensive features, such as segment neural networks or other high-order features. A typical solution is to use approximate decoding, either by beam pruning in a single pass or by beam pruning to generate a lattice followed by a second pass. In this work, we study discriminative segmental models trained with a hinge loss (i.e., segmental structured SVMs). We show that beam search is not suitable for learning rescoring models in this approach, though it gives good approximate decoding performance when the model is already well-trained. Instead, we consider an approach inspired by structured prediction cascades, which use max-marginal pruning to generate lattices. We obtain a highaccuracy phonetic recognition system with several expensive feature types: a segment neural network, a second-order language model, and second-order phone boundary features.Index Terms-segmental conditional random field, structured prediction cascades, phone recognition, segment neural network, beam search

show abstract

Section: Other Expensive Featuresmentioning

confidence: 99%

Discriminative segmental cascades for feature-rich phone recognition

Tang

Wang

Gimpel

et al. 2015

2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)

View full text Add to dashboard Cite

show abstract

“…On the other hand, all layers can be compressed using Xue et al's approach, which accelerates recognition, but makes training even more expensive. The value of the linear bottleneck structure as a regularization method has not been identified in [15,16]. Our experiments show that using MN-SGD allows for a factorization of all layers when training from scratch.…”

Section: Improving Generalization Performancementioning

confidence: 77%

“…Recently, a more sophisticated approach for reducing the size of DNNs has been proposed. Sainath et al [15] factored the weight matrices into the product of two smaller matrices. This is equivalent with inserting a linear bottleneck between two layers of the network.…”

Section: Improving Generalization Performancementioning

confidence: 99%

See 1 more Smart Citation

Mean-normalized stochastic gradient for large-scale deep learning

Wiesler

Richard

Schlüter

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only because it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy.

show abstract

“…A sparse A has computational benefits like low storage and computational complexity. Consequently, this work could be useful in sparse lowrank matrix factorization which has numerous applications in machine learning including learning [7] and deep neural networks (deep learning) [8] and autoencoding. This work is also related to optimizing projection matrices introduced in [9].…”

Section: Introductionmentioning

confidence: 99%

Metric learning with rank and sparsity constraints

Bah

Becker

Cevher

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Choosing a distance preserving measure or metric is fundamental to many signal processing algorithms, such as kmeans, nearest neighbor searches, hashing, and compressive sensing. In virtually all these applications, the efficiency of the signal processing algorithm depends on how fast we can evaluate the learned metric. Moreover, storing the chosen metric can create space bottlenecks in high dimensional signal processing problems. As a result, we consider data dependent metric learning with rank as well as sparsity constraints. We propose a new non-convex algorithm and empirically demonstrate its performance on various datasets; a side benefit is that it is also much faster than existing approaches. The added sparsity constraints significantly improve the speed of multiplying with the learned metrics without sacrificing their quality.

show abstract

Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets

Cited by 490 publications

References 9 publications

Discriminative segmental cascades for feature-rich phone recognition

Discriminative segmental cascades for feature-rich phone recognition

Mean-normalized stochastic gradient for large-scale deep learning

Metric learning with rank and sparsity constraints

Contact Info

Product

Resources

About