Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Zhang, Ying; Pezeshki, Mohammad; Brakel, Philémon; Zhang, Saizheng; Bengio, Cesar Laurent Yoshua; Courville, Aaron

doi:10.48550/arxiv.1701.02720

Cited by 45 publications

(41 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During the last decade, deep neural networks (DNN) have encountered a wide success in automatic speech recognition. Many architectures such as recurrent (RNN) [34,15,1,31,13], time-delay (TDNN) [39,28], or convolutional neural networks (CNN) [42] have been proposed and achieved better performances than traditional hidden Markov models (HMM) combined with gaussian mixtures models (GMM) in different speech recognition tasks. However, despite such evolution of models and paradigms, the acoustic feature representation remains almost the same.…”

Section: Introductionmentioning

confidence: 99%

Speech recognition with quaternion neural networks

Parcollet,

Ravanelli,

Morchid

et al. 2018

Preprint

View full text Add to dashboard Cite

Neural network architectures are at the core of powerful automatic speech recognition systems (ASR). However, while recent researches focus on novel model architectures, the acoustic input features remain almost unchanged. Traditional ASR systems rely on multidimensional acoustic features such as the Mel filter bank energies alongside with the first, and second order derivatives to characterize time-frames that compose the signal sequence. Considering that these components describe three different views of the same element, neural networks have to learn both the internal relations that exist within these features, and external or global dependencies that exist between the time-frames. Quaternion-valued neural networks (QNN), recently received an important interest from researchers to process and learn such relations in multidimensional spaces. Indeed, quaternion numbers and QNNs have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with up to four times less learning parameters than real-valued models. We propose to investigate modern quaternion-valued models such as convolutional and recurrent quaternion neural networks in the context of speech recognition with the TIMIT dataset. The experiments show that QNNs always outperform real-valued equivalent models with way less free parameters, leading to a more efficient, compact, and expressive representation of the relevant information.32nd Conference on Neural Information Processing Systems (NIPS 2018),

show abstract

Section: Introductionmentioning

confidence: 99%

Speech recognition with quaternion neural networks

Parcollet,

Ravanelli,

Morchid

et al. 2018

Preprint

View full text Add to dashboard Cite

show abstract

“…We designed the XNE around a lean hardware engine focused on the execution of the feature loops of Listing 1. We execute these as hardwired inner loops, operating in principle on a fixed-sized input tiles in a fixed number of cycles 2 . A design-time throughput parameter (TP) is used to define the size of each tile, which is also the number of simultaneous XNOR operations the datapath can execute per cycle; every TP cycles, the accelerator consumes one set of TP input binary pixels and TP sets of TP binary weights to produce one set of TP output pixels.…”

Section: Accelerator Architecturementioning

confidence: 99%

“…Once an output feature vector has been produced by the XNE datapath, it is completely computed and never used again. With the microcoding strategy proposed in Listing 3, a single input feature vector has to be reloaded fs 2 times, and afterwards it is completely consumed.…”

Section: Microcode Processormentioning

confidence: 99%

“…T ODAY, deep learning enables specialized cognitioninspired inference from collected data for a variety of different tasks such as computer vision [1], voice recognition [2], big data analytics [3], financial forecasts [4]. However, this technology could unleash an even higher impact on ordinary people's life if it was not limited by the constraints of data center computing, such as high latency and dependency on radio communications, with its privacy and dependability issues and hidden memory costs.…”

Section: Introductionmentioning

confidence: 99%

“…The * and += operators indicate XNOR and popcount-accumulation respectively 2. The XNE can actually be congured to operate on smaller tiles when it is necessary, with a proportional decrease in loop latency.…”

mentioning

confidence: 99%

See 2 more Smart Citations

XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference

Conti¹,

Schiavone²,

Benini³

2018

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

108

View full text Add to dashboard Cite

show abstract

Artificial intelligence‐empowered vision‐based self driver assistance system for internet of autonomous vehicles

Rawlley

Gupta

2022

Trans Emerging Tel Tech

View full text Add to dashboard Cite

Artificial intelligence (AI) and edge computing have truly advanced in vehicular networks encouraging assessment of real-time traffic conditions using kinetic information of autonomous vehicles with the help of road side units (RSUs). However, regardless of numerous improvements in sensor fusion technologies the existing vision/LIDAR-based systems have found severe difficulties during perception on roads. In addition, the inter-vehicular communications are hampered due to inefficient RSU placement techniques causing high-latency issues during transmission of messages. Therefore, this article presents an AI-driven vision-based self driver assistance system (VSDAS) comprising a joint RSU deployment mechanism that utilizes enhanced memetic architecture-based optimal RSU placement (MARP) method and an object detection model that implements an improved Haar-cascade object detection algorithm for speedy identification of object. We have designed two varieties of genetic algorithm (GA) to solve optimal placement problem of RSUs: genetic architecture-based with random restart hill climbing (GARRH) and the other is MARP for efficient placement of RSUs. After our experimental results, we see that the MARP algorithm shows best possible RSU locations over different generations achieving significantly better fitness scores than the GAHRC and GA ascribing to its local search process. In addition, Haar-cascade achieves better mean average precision than local binary pattern and histogram of oriented gradients by selecting key frames. The experimental outcomes of our model reveals that the proposed enhanced memetic algorithm reduces the transmission delay to a greater extent. Additionally, the outcomes of our investigations on two public datasets (KITTI and Panasonic) showed that our improved algorithm clearly enhances the object detection performance.

show abstract

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Cited by 45 publications

References 16 publications

Speech recognition with quaternion neural networks

Speech recognition with quaternion neural networks

XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference

Artificial intelligence‐empowered vision‐based self driver assistance system for internet of autonomous vehicles

Contact Info

Product

Resources

About