Abstract:This paper presents a waveform modeling and generation method using hierarchical recurrent neural networks (HRNN) for speech bandwidth extension (BWE). Different from conventional BWE methods which predict spectral parameters for reconstructing wideband speech waveforms, this BWE method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN which is an unconditional neural audio generator, the HRNN model represents the distribution of each wideband or high-frequency wavefor… Show more
“…Feng et al [6] used FFTNet [26] which resembles the classical FFT process. Ling et al [27] proposed a hierarchical RNN to utilize the waveform structures. Several other efforts incorporated time-frequency information while still operating in the time domain.…”
Speech generation and enhancement have seen recent breakthroughs in quality thanks to deep learning. These methods typically operate at a limited sampling rate of 16-22kHz due to computational complexity and available datasets. This limitation imposes a gap between the output of such methods and that of high-fidelity (≥44kHz) real-world audio applications. This paper proposes a new bandwidth extension (BWE) method that expands 8-16kHz speech signals to 48kHz. The method is based on a feed-forward WaveNet architecture trained with a GAN-based deep feature loss. A mean-opinionscore (MOS) experiment shows significant improvement in quality over state-of-the-art BWE methods. An AB test reveals that our 16to-48kHz BWE is able to achieve fidelity that is typically indistinguishable from real high-fidelity recordings. We use our method to enhance the output of recent speech generation and denoising methods, and experiments demonstrate significant improvement in sound quality over these baselines. We propose this as a general approach to narrow the gap between generated speech and recorded speech, without the need to adapt such methods to higher sampling rates.
“…Feng et al [6] used FFTNet [26] which resembles the classical FFT process. Ling et al [27] proposed a hierarchical RNN to utilize the waveform structures. Several other efforts incorporated time-frequency information while still operating in the time domain.…”
Speech generation and enhancement have seen recent breakthroughs in quality thanks to deep learning. These methods typically operate at a limited sampling rate of 16-22kHz due to computational complexity and available datasets. This limitation imposes a gap between the output of such methods and that of high-fidelity (≥44kHz) real-world audio applications. This paper proposes a new bandwidth extension (BWE) method that expands 8-16kHz speech signals to 48kHz. The method is based on a feed-forward WaveNet architecture trained with a GAN-based deep feature loss. A mean-opinionscore (MOS) experiment shows significant improvement in quality over state-of-the-art BWE methods. An AB test reveals that our 16to-48kHz BWE is able to achieve fidelity that is typically indistinguishable from real high-fidelity recordings. We use our method to enhance the output of recent speech generation and denoising methods, and experiments demonstrate significant improvement in sound quality over these baselines. We propose this as a general approach to narrow the gap between generated speech and recorded speech, without the need to adapt such methods to higher sampling rates.
“…Different approaches for extension of excitation signal are presented in [2], [3]. Different techniques for estimating WIB spectral envelop are presented in [3][4][5][6][7]. However, traditional artificial bandwidth extension methods are suffering from reconstructing WIB speech with high quality under all conditions [8].…”
The limited narrowband frequency range, about 300-3400Hz, used in telephone network channels results in less intelligible and poor-quality telephony speech. To address this drawback, a novel robust speech bandwidth extension using Discrete Wavelet Transform-Discrete Cosine Transform Based Data Hiding is proposed. In this technique, the missing speech information is embedded in the narrowband speech signal. The embedded missing speech information is recovered steadily at the receiver end to generate a wideband speech of considerably better quality. The robustness of the proposed method to quantization and channel noises is confirmed by the mean square error test. The enhancement in the quality of reconstructed wideband speech of the proposed method over conventional methods is reasserted by subjective listening and objective tests.
“…In [18] [19], Recurrent Neural Networks (RNNs) were introduced into the structure of the MPC, because they can capture the system dynamics and provide long-range predictions [20]. It is well-known that RNNs have issues with vanishing and exploding gradients, which makes their training difficult sometimes, therefore we propose to use a special form of RNN, i.e., the Long Short Term Memory (LSTM).…”
Reverse Osmosis (RO) desalination plants are highly nonlinear multi-input-multioutput systems that are affected by uncertainties, constraints and some physical phenomena such as membrane fouling that are mathematically difficult to describe. Such systems require effective control strategies that take these effects into account. Such a control strategy is the nonlinear model predictive (NMPC) controller. However, an NMPC depends very much on the accuracy of the internal model used for prediction in order to maintain feasible operating conditions of the RO desalination plant. Recurrent Neural Networks (RNNs), especially the Long-Short-Term Memory (LSTM) can capture complex nonlinear dynamic behavior and provide long-range predictions even in the presence of disturbances. Therefore, in this paper an NMPC for a RO desalination plant that utilizes an LSTM as the predictive model will be presented. It will be tested to maintain a given permeate flow rate and keep the permeate concentration under a certain limit by manipulating the feed pressure. Results show a good performance of the system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.