Architecture for Low Power Large Vocabulary Speech Recognition

Chandra, D.; Pazhayaveetil, U.; Franzon, Paul D.

doi:10.1109/socc.2006.283836

Cited by 7 publications

(3 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We chose MFCC in this paper according to the demand of recognition performance and limit of hardware resource. Storage space and recognition time are increased along with the increase of the characteristic dimension, therefore needing to choose the proper characteristic parameters for the sake of good real-time performance in the embedded recognition system, rather than using 39 dimensions feather composed by MFCC [14], [15], first-order difference, second-order difference, normalized transient energy and differential power. We identified the contribution of feature components to recognition performance, and chose the 27 dimensions feather components with biggest contribution, effectively reducing the SRAM area resources consumption by storage characteristic parameters.…”

Section: Implementation Of Socmentioning

confidence: 99%

The Implementation of Chinese and English Bilingual Speech Recognition System-on-Chip

Ding¹

2013

IJEEEE

View full text Add to dashboard Cite

This paper presents a high performance embedded non-specific, medium vocabulary Chinese-English bilingual speech recognition system using the continuous density hidden Markov model and a two-pass search strategy based on a 16-bit fixed-point digital signal processing (DSP). This system selecting MFCC parameters as recognition feature. Improve the system real-time through a dedicated hardware circuit design. We extract specialized hardware co-processing circuit characterized by structural features through abstracting algorithm critical path that speedy computation concerns much in speech recognition, so as to greatly enhance the overall performance by little chip costs. The experimental result suggested that the identification rate is 97.6% when entries sum is 600. The characteristic storage space was reduced 31%, and the real-time rate of two stage identification is 0.7

show abstract

Section: Implementation Of Socmentioning

confidence: 99%

The Implementation of Chinese and English Bilingual Speech Recognition System-on-Chip

Ding¹

2013

IJEEEE

View full text Add to dashboard Cite

show abstract

“…Systems for natural language speech recognition typically utilize three main processing stages (Fig 1) [1]. After the incoming utterance is sampled and digitized in the DSP stage (Phase 1), the generated feature vector enters the Acoustic Modeling stage (Phase 2), where it is compared to a list of senones in the library.…”

Section: Introductionmentioning

confidence: 99%

HW/SW architecture for speech recognition acceleration

Fastow¹,

Rosner²,

Natarajan³

et al. 2013

2013 IEEE International Conference on Consumer Electronics (ICCE)

View full text Add to dashboard Cite

show abstract

“…A hardware co-processor is proposed in [60] to boost the performance of the GMM computation in Sphinx3. Reducing the size of the mantissa from 23-bits to 15-bits and 12-bits is proposed in [22] to reduce the acoustic model size, providing a compression ratio of 1.39x and 1.6x respectively. The technique is evaluated using a small vocabulary size of 5000 words, whereas we propose a novel clustering technique to achieve 8x reduction in acoustic model size with a 130k words vocabulary size.…”

Section: Hardware Solutionsmentioning

confidence: 99%

Low-power architectures for automatic speech recognition

Tabani¹

View full text Add to dashboard Cite

Automatic Speech Recognition (ASR) is one of the most important applications in the area of cognitive computing. Fast and accurate ASR is emerging as a key application for mobile and wearable devices. These devices, such as smartphones, have incorporated speech recognition as one of the main interfaces for user interaction. This trend towards voice-based user interfaces is likely to continue in the next years which is changing the way of human-machine interaction. Effective speech recognition systems require real-time recognition, which is challenging for mobile devices due to the compute-intensive nature of the problem and the power constraints of such systems and involves a huge effort for CPU architectures to reach it. GPU architectures offer parallelization capabilities which can be exploited to increase the performance of speech recognition systems. However, efficiently utilizing the GPU resources for speech recognition is also challenging, as the software implementations exhibit irregular and unpredictable memory accesses and poor temporal locality. The purpose of this thesis is to study the characteristics of ASR systems running on low-power mobile devices in order to propose different techniques to improve performance and energy consumption. We propose several software-level optimizations driven by the power/performance analysis. Unlike previous proposals that trade accuracy for performance by reducing the number of Gaussians evaluated, we maintain accuracy and improve performance by effectively using the underlying CPU microarchitecture. We use a refactored implementation of the GMM evaluation code to ameliorate the impact of branches. Then, we exploit the vector unit available on most modern CPUs to boost GMM computation, introducing a novel memory layout for storing the means and variances of the Gaussians in order to maximize the effectiveness of vectorization. In addition, we compute the Gaussians for multiple frames in parallel, significantly reducing memory bandwidth usage. Our experimental results show that the proposed optimizations provide 2.68x speedup over the baseline Pocketsphinx decoder on a high-end Intel Skylake CPU, while achieving 61% energy savings. On a modern ARM Cortex-A57 mobile processor our techniques improve performance by 1.85x, while providing 59% energy savings without any loss in the accuracy of the ASR system. Secondly, we propose a register renaming technique that exploits register reuse to reduce the pressure on the register file. Our technique leverages physical register sharing by introducing minor changes in the register map table and the issue queue. We evaluated our renaming technique on top of a modern out-of-order processor. The proposed scheme supports precise exceptions and we show that it results in 9.5% performance improvements for GMM evaluation. Our experimental results show that the proposed register renaming scheme provides 6% speedup on average for the SPEC2006 benchmarks. Alternatively, our renaming scheme achieves the same performance while reducing the number of physical registers by 10.5%. Finally, we propose a hardware accelerator for GMM evaluation that reduces the energy consumption by three orders of magnitude compared to solutions based on CPUs and GPUs. The proposed accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the GMM parameters, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x. El reconocimiento automático de voz (ASR) es una de las aplicaciones más importantes en el área de la computación cognitiva. ASR rápido y preciso se está convirtiendo en una aplicación clave para dispositivos móviles y portátiles. Estos dispositivos, como los Smartphones, han incorporado el reconocimiento de voz como una de las principales interfaces de usuario. Es probable que esta tendencia hacia las interfaces de usuario basadas en voz continúe en los próximos años, lo que está cambiando la forma de interacción humano-máquina. Los sistemas de reconocimiento de voz efectivos requieren un reconocimiento en tiempo real, que es un desafío para los dispositivos móviles debido a la naturaleza de cálculo intensivo del problema y las limitaciones de potencia de dichos sistemas y supone un gran esfuerzo para las arquitecturas de CPU. Las arquitecturas GPU ofrecen capacidades de paralelización que pueden aprovecharse para aumentar el rendimiento de los sistemas de reconocimiento de voz. Sin embargo, la utilización eficiente de los recursos de la GPU para el reconocimiento de voz también es un desafío, ya que las implementaciones de software presentan accesos de memoria irregulares e impredecibles y una localidad temporal deficiente. El propósito de esta tesis es estudiar las características de los sistemas ASR que se ejecutan en dispositivos móviles de baja potencia para proponer diferentes técnicas para mejorar el rendimiento y el consumo de energía. Proponemos varias optimizaciones a nivel de software impulsadas por el análisis de potencia y rendimiento. A diferencia de las propuestas anteriores que intercambian precisión por el rendimiento al reducir el número de gaussianas evaluadas, mantenemos la precisión y mejoramos el rendimiento mediante el uso efectivo de la microarquitectura subyacente de la CPU. Usamos una implementación refactorizada del código de evaluación de GMM para reducir el impacto de las instrucciones de salto. Explotamos la unidad vectorial disponible en la mayoría de las CPU modernas para impulsar el cálculo de GMM. Además, calculamos las gaussianas para múltiples frames en paralelo, lo que reduce significativamente el uso de ancho de banda de memoria. Nuestros resultados experimentales muestran que las optimizaciones propuestas proporcionan un speedup de 2.68x sobre el decodificador Pocketsphinx en una CPU Intel Skylake de alta gama, mientras que logra un ahorro de energía del 61%. En segundo lugar, proponemos una técnica de renombrado de registros que explota la reutilización de registros físicos para reducir la presión sobre el banco de registros. Nuestra técnica aprovecha el uso compartido de registros físicos mediante la introducción de cambios en la tabla de renombrado de registros y la issue queue. Evaluamos nuestra técnica de renombrado sobre un procesador moderno. El esquema propuesto admite excepciones precisas y da como resultado mejoras de rendimiento del 9.5% para la evaluación GMM. Nuestros resultados experimentales muestran que el esquema de renombrado de registros propuesto proporciona un 6% de aceleración en promedio para SPEC2006. Finalmente, proponemos un acelerador para la evaluación de GMM que reduce el consumo de energía en tres órdenes de magnitud en comparación con soluciones basadas en CPU y GPU. El acelerador propuesto implementa un esquema de evaluación perezosa donde las GMMs se calculan bajo demanda, evitando el 50% de los cálculos. Finalmente, incluye un esquema de memorización que evita el 74.88% de las operaciones de coma flotante. El diseño final proporciona una aceleración de 164x y una reducción de energía de 3532x en comparación con una implementación altamente optimizada que se ejecuta en una CPU móvil moderna. Comparado con una GPU móvil de última generación, el acelerador de GMM logra un speedup de 5.89x sobre una implementación CUDA optimizada, mientras que reduce la energía en 241x.

show abstract

Architecture for Low Power Large Vocabulary Speech Recognition

Cited by 7 publications

References 7 publications

The Implementation of Chinese and English Bilingual Speech Recognition System-on-Chip

The Implementation of Chinese and English Bilingual Speech Recognition System-on-Chip

HW/SW architecture for speech recognition acceleration

Low-power architectures for automatic speech recognition

Contact Info

Product

Resources

About