We describe a new approach to speech recognition, in which all Hidden Markov Model (HMM) states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state. The model is defined by vectors associated with each state with a dimension of, say, 50, together with a global mapping from this vector space to the space of parameters of the GMM. This model appears to give better results than a conventional model, and the extra structure offers many new opportunities for modeling innovations while maintaining compatibility with most standard techniques.
Although research has previously been done on multilingual speech recognition, it has been found to be very difficult to improve over separately trained systems. The usual approach has been to use some kind of "universal phone set" that covers multiple languages. We report experiments on a different approach to multilingual speech recognition, in which the phone sets are entirely distinct but the model has parameters not tied to specific states that are shared across languages. We use a model called a "Subspace Gaussian Mixture Model" where states' distributions are Gaussian Mixture Models with a common structure, constrained to lie in a subspace of the total parameter space. The parameters that define this subspace can be shared across languages. We obtain substantial WER improvements with this approach, especially with very small amounts of inlanguage training data.
Abstract-The growing requirements for broadcasting and streaming of high quality video continue to trigger demands for codecs with higher compression efficiency. AV1 is the most recent open and royalty free video coding specification developed by Alliance for Open Media (AOMedia) with a declared ambition of becoming the most popular next generation video coding standard. Primary alternatives to AV1 are the VP9 and the HEVC/H.265 which are currently among the most popular and widespread video codecs used in applications. VP9 is also a royalty free and open specification similar to AV1, while HEVC/H.265 requires specific licensing terms for its use in commercial products and services. In this paper, we compare AV1 to VP9 and HEVC/H.265 from rate distortion point of view in a broadcasting use case scenario. Mutual comparison is performed by means of subjective evaluations carried out in a controlled environment using HD video content with typical bitrates ranging from low to high, corresponding to very low up to completely transparent quality. We then proceed with an in-depth analysis of advantages and drawbacks of each codec for specific types of content and compare the subjective comparisons and conclusions to those obtained by others in the state of the art as well to those measured by means of objective metrics such as PSNR.
Recently deep learning based image compression has made rapid advances with promising results based on objective quality metrics. However, a rigorous subjective quality evaluation on such compression schemes have rarely been reported. This paper aims at perceptual quality studies on learned compression. First, we build a general learned compression approach, and optimize the model. In total six compression algorithms are considered for this study. Then, we perform subjective quality tests in a controlled environment using high-resolution images. Results demonstrate learned compression optimized by MS-SSIM yields competitive results that approach the efficiency of state-of-the-art compression. The results obtained can provide a useful benchmark for future developments in learned image compression.Index Terms-Subjective and objective quality evaluation, learning image compression, compression standards.
Preparation of a lexicon for speech recognition systems can be a significant effort in languages where the written form is not exactly phonetic. On the other hand, in languages where the written form is quite phonetic, some common words are often mispronounced. In this paper, we use a combination of lexicon learning techniques to explore whether a lexicon can be learned when only a small lexicon is available for boot-strapping. We discover that for a phonetic language such as Spanish, it is possible to do that better than what is possible from generic rules or hand-crafted pronunciations. For a more complex language such as English, we find that it is still possible but with some loss of accuracy.
Privacy protection is drawing more attention with the advances in image processing, visual and social media. Photo sharing is a popular activity, which also brings the concern of regulating permissions associated with shared content. This paper presents a method for protecting user privacy in omnidirectional media, by removing parts of the content selected by the user, in a reversible manner. Object removal is carried out using three different state-of-the-art inpainting methods, employed over the mask drawn in the viewport domain so that the geometric distortions are minimized. The perceived quality of the scene is assessed via subjective tests, comparing the proposed method against inpainting employed directly on the equirectangular image. Results on distinct contents indicate our object removal methodology on the viewport enhances perceived quality, thereby improves privacy protection as the user is able to hide objects with less distortion in the overall image.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.