The quality of speech codecs deteriorates at low bitrates due to high quantization noise. A post-filter is generally employed to enhance the quality of the coded speech. In this paper, a data-driven postfilter relying on masking in the time-frequency domain is proposed. A fully connected neural network (FCNN), a convolutional encoderdecoder (CED) network and a long short-term memory (LSTM) network are implemeted to estimate a real-valued mask per timefrequency bin. The proposed models were tested on the five lowest operating modes (6.65 kbps-15.85 kbps) of the Adaptive Multi-Rate Wideband codec (AMR-WB). Both objective and subjective evaluations confirm the enhancement of the coded speech and also show the superiority of the mask-based neural network system over a conventional heuristic post-filter used in the standard like ITU-T G.718.
Speech and audio codecs model the overall shape of the signal spectrum using envelope models. In speech coding the predominant approach is linear predictive coding, which offers high coding efficiency at the cost of computational complexity and a rigid systems design. Audio codecs are usually based on scale factor bands, whose calculation and coding is simple, but whose coding efficiency is lower than that of linear prediction. In the current work we propose an entropy coding approach for scale factor bands, with the objective of reaching the same coding efficiency as linear prediction, but simultaneously retaining a low computational complexity. The proposed method is based on quantizing the distribution of spectral mass using betadistributions. Our experiments show that the perceptual quality achieved with the proposed method is similar to that of linear predictive models with the same bit rate, while the design simultaneously allows variable bit-rate coding and can easily be scaled to different sampling rates. The algorithmic complexity of the proposed method is less than one third of traditional multi-stage vector quantization of linear predictive envelopes.
Spectral envelope modelling is a central part of speech and audio codecs and is traditionally based on either vector quantization or scalar quantization followed by entropy coding. To bridge the coding performance of vector quantization with the low complexity of the scalar case, we propose an iterative approach for entropy coding the spectral envelope parameters. For each parameter, a univariate probability distribution is derived from a Gaussian mixture model of the joint distribution and the previously quantized parameters used as a-priori information. Parameters are then iteratively and individually scalar quantized and entropy coded. Unlike vector quantization, the complexity of proposed method does not increase exponentially with dimension and bitrate. Moreover, the coding resolution and dimension can be adaptively modified without retraining the model. Experimental results show that these important advantages do not impair coding efficiency compared to a state-of-art vector quantization scheme.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.