An Automatic Lipreading System for Spoken Digits With Limited Training Data

Wang, S.L.; Liew, Alan Wee‐Chung; Lau, Wing Cheong; Leung, S.H.

doi:10.1109/tcsvt.2008.2004924

Cited by 24 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lip image processing has attracted wide-spread research interest in recent years for its wide application in automatic visual speech recognition [1][2], visual speaker authentication [3][4][5], lip synchronization for facial animation [6], etc. Lip region segmentation, which is also referred to as lip segmentation, is the first and most crucial step in various lip-related applications [7].…”

Section: Introductionmentioning

confidence: 99%

Lip image segmentation based on a fuzzy convolutional neural network

Guan

Wang

Liew

2019

IEEE Trans. Fuzzy Syst.

View full text Add to dashboard Cite

Research has shown that the human lip and its movements are a rich source of information related to speech content and speaker's identity. Lip image segmentation, as a fundamental step in many lipreading and visual speaker authentication systems, is of vital importance. Because of variations in lip color, lighting conditions and especially the complex appearance of an open mouth, accurate lip region segmentation is still a challenging task. To address this problem, this paper proposes a new fuzzy deep neural network having an architecture that integrates fuzzy units and traditional convolutional units. The convolutional units are used to extract discriminative features at different scales to provide comprehensive information for pixel-level lip segmentation. The fuzzy logic modules are employed to handle various kinds of uncertainties and to provide a more robust segmentation result. An end-to-end training scheme is then used to learn the optimal parameters for both the fuzzy and the convolutional units. A dataset containing more than 48,000 images of various speakers, under different lighting conditions, was used to evaluate lip segmentation performance. According to the experimental results, the proposed method achieves state-of-the-art performance when compared with other algorithms.

show abstract

Section: Introductionmentioning

confidence: 99%

Lip image segmentation based on a fuzzy convolutional neural network

Guan

Wang

Liew

2019

IEEE Trans. Fuzzy Syst.

View full text Add to dashboard Cite

show abstract

“…In recent years, image processing techniques have been extensively developed for human lip recognition, which can automatically detect and analyse the unstable shape of human lips and distinguish in real time whether the user is speaking or not. Examples include audiovisual speech recognition (AVSR) [1], visual speech recognition (VSR) [2,3], speaker recognition [4][5][6], intelligent humancomputer interaction (IHCI) [7], vision-based voice activity detection (VVAD), etc. Research in the field of speech technology has achieved remarkable results both at home and abroad.…”

Section: Literaturesmentioning

confidence: 99%

Lip language identification via Wavelet entropy and K-nearest neighbor algorithm

Wang¹,

Cui²,

Gao³

et al. 2021

EAI Endorsed Transactions on e-Learning

View full text Add to dashboard Cite

INTRODUCTION: Image processing technology is widely used in lip recognition, which can automatically detect and analyse the unstable shape of human lips. OBJECTIVES: In this paper, we propose a new algorithm using Wavelet entropy (WE) and K-nearest neighbor (KNN) improves the accuracy of lip recognition. METHODS: At present, the two most commonly used technologies are wavelet transform and -nearest neighbor algorithm. Wavelet transform is a set of image descriptors, and the -nearest neighbor algorithm has high accuracy. After a large number of experiments, we propose a lip recognition method based on Wavelet entropy and -nearest neighbor, which combines Wavelet entropy, -nearest neighbor and K-fold cross validation. RESULTS: This method reduces the calculation time and improves the training speed. The best result of the experiment improves the accuracy to 80.08%. CONCLUSION: Therefore, our algorithm is superior to other state-of-the-art approaches of lip recognition.

show abstract

“…There are 23 ALR architectures targeting digit or alphabet recognition since 2007. Looking at Tables 4, 5 and 6 we observe that most traditional systems use feature techniques based on image transforms [108,9,66,109,110] or shape and appearance models [56,111,112,7,113]. In Figure 4 we show i) the number of times that each feature technique has been integrated into ALR systems addressing digit or letter recognition; ii) the same for each classification method.…”

Section: Digit and Letter Recognitionmentioning

confidence: 99%

Survey on automatic lip-reading in the era of deep learning

Fernandez-Lopez

Sukno

2018

Image and Vision Computing

View full text Add to dashboard Cite

In the last few years, there has been an increasing interest in developing systems for Automatic LipReading (ALR). Similarly to other computer vision applications, methods based on Deep Learning (DL) have become very popular and have permitted to substantially push forward the achievable performance. In this survey, we review ALR research during the last decade, highlighting the progression from approaches previous to DL (which we refer to as traditional) toward end-to-end DL architectures. We provide a comprehensive list of the audiovisual databases available for lipreading, describing what tasks they can be used for, their popularity and their most important characteristics, such as the number of speakers, vocabulary size, recording settings and total duration. In correspondence with the shift toward DL, we show that there is a clear tendency toward large-scale datasets targeting realistic application settings and large numbers of samples per class. On the other hand, we summarize, discuss and compare the different ALR systems proposed in the last decade, separately considering traditional and DL approaches. We address a quantitative analysis of the different systems by organizing them in terms of the task that they target (e.g. recognition of letters or digits and words or sentences) and comparing their reported performance in the most commonly used datasets. As a result, we find that DL architectures perform similarly to traditional ones for simpler tasks but report significant improvements in more complex tasks, such as word or sentence recognition, with up to 40% improvement in word recognition rates. Hence, we provide a detailed description of the available ALR systems based on end-to-end DL architectures and identify a tendency to focus on the modeling of temporal context as the key to advance the field. Such modeling is dominated by recurrent neural networks due to their ability to retain context at multiple scales (e.g. short-and longterm information). In this sense, current efforts tend toward techniques that allow a more comprehensive modeling and interpretability of the retained context.

show abstract

An Automatic Lipreading System for Spoken Digits With Limited Training Data

Cited by 24 publications

References 13 publications

Lip image segmentation based on a fuzzy convolutional neural network

Lip image segmentation based on a fuzzy convolutional neural network

Lip language identification via Wavelet entropy and K-nearest neighbor algorithm

Survey on automatic lip-reading in the era of deep learning

Contact Info

Product

Resources

About