Abstract:It is well known that visual cues of lip movement contain important speech relevant information. This paper presents an automatic lipreading system for small vocabulary speech recognition tasks. Using the lip segmentation and modeling techniques we developed earlier, we obtain a visual feature vector composed of outer and inner mouth features from the lip image sequence for recognition. A spline representation is employed to transform the discrete-time sampled features from the video frames into the continuous… Show more
“…Lip image processing has attracted wide-spread research interest in recent years for its wide application in automatic visual speech recognition [1][2], visual speaker authentication [3][4][5], lip synchronization for facial animation [6], etc. Lip region segmentation, which is also referred to as lip segmentation, is the first and most crucial step in various lip-related applications [7].…”
Research has shown that the human lip and its movements are a rich source of information related to speech content and speaker's identity. Lip image segmentation, as a fundamental step in many lipreading and visual speaker authentication systems, is of vital importance. Because of variations in lip color, lighting conditions and especially the complex appearance of an open mouth, accurate lip region segmentation is still a challenging task. To address this problem, this paper proposes a new fuzzy deep neural network having an architecture that integrates fuzzy units and traditional convolutional units. The convolutional units are used to extract discriminative features at different scales to provide comprehensive information for pixel-level lip segmentation. The fuzzy logic modules are employed to handle various kinds of uncertainties and to provide a more robust segmentation result. An end-to-end training scheme is then used to learn the optimal parameters for both the fuzzy and the convolutional units. A dataset containing more than 48,000 images of various speakers, under different lighting conditions, was used to evaluate lip segmentation performance. According to the experimental results, the proposed method achieves state-of-the-art performance when compared with other algorithms.
“…Lip image processing has attracted wide-spread research interest in recent years for its wide application in automatic visual speech recognition [1][2], visual speaker authentication [3][4][5], lip synchronization for facial animation [6], etc. Lip region segmentation, which is also referred to as lip segmentation, is the first and most crucial step in various lip-related applications [7].…”
Research has shown that the human lip and its movements are a rich source of information related to speech content and speaker's identity. Lip image segmentation, as a fundamental step in many lipreading and visual speaker authentication systems, is of vital importance. Because of variations in lip color, lighting conditions and especially the complex appearance of an open mouth, accurate lip region segmentation is still a challenging task. To address this problem, this paper proposes a new fuzzy deep neural network having an architecture that integrates fuzzy units and traditional convolutional units. The convolutional units are used to extract discriminative features at different scales to provide comprehensive information for pixel-level lip segmentation. The fuzzy logic modules are employed to handle various kinds of uncertainties and to provide a more robust segmentation result. An end-to-end training scheme is then used to learn the optimal parameters for both the fuzzy and the convolutional units. A dataset containing more than 48,000 images of various speakers, under different lighting conditions, was used to evaluate lip segmentation performance. According to the experimental results, the proposed method achieves state-of-the-art performance when compared with other algorithms.
“…In recent years, image processing techniques have been extensively developed for human lip recognition, which can automatically detect and analyse the unstable shape of human lips and distinguish in real time whether the user is speaking or not. Examples include audiovisual speech recognition (AVSR) [1], visual speech recognition (VSR) [2,3], speaker recognition [4][5][6], intelligent humancomputer interaction (IHCI) [7], vision-based voice activity detection (VVAD), etc. Research in the field of speech technology has achieved remarkable results both at home and abroad.…”
INTRODUCTION: Image processing technology is widely used in lip recognition, which can automatically detect and analyse the unstable shape of human lips. OBJECTIVES: In this paper, we propose a new algorithm using Wavelet entropy (WE) and K-nearest neighbor (KNN) improves the accuracy of lip recognition. METHODS: At present, the two most commonly used technologies are wavelet transform and -nearest neighbor algorithm. Wavelet transform is a set of image descriptors, and the -nearest neighbor algorithm has high accuracy. After a large number of experiments, we propose a lip recognition method based on Wavelet entropy and -nearest neighbor, which combines Wavelet entropy, -nearest neighbor and K-fold cross validation. RESULTS: This method reduces the calculation time and improves the training speed. The best result of the experiment improves the accuracy to 80.08%. CONCLUSION: Therefore, our algorithm is superior to other state-of-the-art approaches of lip recognition.
“…There are 23 ALR architectures targeting digit or alphabet recognition since 2007. Looking at Tables 4, 5 and 6 we observe that most traditional systems use feature techniques based on image transforms [108,9,66,109,110] or shape and appearance models [56,111,112,7,113]. In Figure 4 we show i) the number of times that each feature technique has been integrated into ALR systems addressing digit or letter recognition; ii) the same for each classification method.…”
In the last few years, there has been an increasing interest in developing systems for Automatic LipReading (ALR). Similarly to other computer vision applications, methods based on Deep Learning (DL) have become very popular and have permitted to substantially push forward the achievable performance. In this survey, we review ALR research during the last decade, highlighting the progression from approaches previous to DL (which we refer to as traditional) toward end-to-end DL architectures. We provide a comprehensive list of the audiovisual databases available for lipreading, describing what tasks they can be used for, their popularity and their most important characteristics, such as the number of speakers, vocabulary size, recording settings and total duration. In correspondence with the shift toward DL, we show that there is a clear tendency toward large-scale datasets targeting realistic application settings and large numbers of samples per class. On the other hand, we summarize, discuss and compare the different ALR systems proposed in the last decade, separately considering traditional and DL approaches. We address a quantitative analysis of the different systems by organizing them in terms of the task that they target (e.g. recognition of letters or digits and words or sentences) and comparing their reported performance in the most commonly used datasets. As a result, we find that DL architectures perform similarly to traditional ones for simpler tasks but report significant improvements in more complex tasks, such as word or sentence recognition, with up to 40% improvement in word recognition rates. Hence, we provide a detailed description of the available ALR systems based on end-to-end DL architectures and identify a tendency to focus on the modeling of temporal context as the key to advance the field. Such modeling is dominated by recurrent neural networks due to their ability to retain context at multiple scales (e.g. short-and longterm information). In this sense, current efforts tend toward techniques that allow a more comprehensive modeling and interpretability of the retained context.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.