A Lip Reading Model Using CNN with Batch Normalization

Bone age is an index used by pediatric radiology and endocrinology departments worldwide to define skeletal maturity for medical and non-medical purposes. In general, the clinical method for bone age assessment (BAA) is based on examining the visual ossification of individual bones in the left hand and then comparing it with a standard radiographic atlas of the hand. However, this method is highly dependent on the experience and conditions of the forensic expert. This paper proposes a new approach to age estimation of human bone based on the carpal bones in the hand and using a residual network architecture. The classification layer was modified with batch normalization to optimize the training process. Before carrying out the training process, we performed an image augmentation technique to make the dataset more varied. The following augmentation techniques were used: resizing; random affine transformation; horizontal flipping; adjusting brightness, contrast, saturation, and hue; and image inversion. The output is the classification of bone age in the range of 1 to 19 years. The results obtained when using a VGG16 model were an MAE value of 5.19 and an R2 value of 0.56 while using the newly developed ResNeXt50(32x4d) model produced an MAE value of 4.75 and an R2 value of 0.63. The research results indicate that the proposed modification of the residual training model improved classification compared to using the VGG16 model, as indicated by an MAE value of 4.75 and an R2 value of 0.63.

show abstract

“…Likewise, the derived gradient was used for updating the weights, following the Adam optimizer, as expressed in Eq. ( 5) [22].…”

Section: F Optimized Hyperparametersmentioning

confidence: 99%

Human Bone Age Estimation of Carpal Bone X-Ray Using Residual Network with Batch Normalization Classification

Nabilah

Sigit

Fariza

et al. 2023

JOIV : Int. J. Inform. Visualization

View full text Add to dashboard Cite

show abstract

“…H. Gupta et al [158] proposed a lip-reading model using CNN batch normalization for audio-less video data. The Haar Cascade algorithm is employed to extract the lip region from each individual frontal facial image in the video sequence and combine them into a single image.…”

Section: Sequence Of Video Framesmentioning

confidence: 99%

Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition

Nemani¹,

Krishna²,

Ramisetty³

et al. 2023

IEEE Trans. Artif. Intell.

View full text Add to dashboard Cite

Speaker-independent visual speech recognition (VSR) is a complex task that involves identifying spoken words or phrases from video recordings of a speaker's facial movements. Decoding the intricate visual dynamics of a speaker's mouth in a high-dimensional space is a significant challenge in this field. To address this challenge, researchers have employed advanced techniques that enable machines to recognize human speech through visual cues automatically. Over the years, there has been a considerable amount of research in the field of VSR involving different algorithms and datasets to evaluate system performance. These efforts have resulted in significant progress in developing effective VSR models, creating new opportunities for further research in this area. This survey provides a detailed examination of the progression of VSR over the past three decades, with a particular emphasis on the transition from speaker-dependent to speaker-independent systems. We also provide a comprehensive overview of the various datasets used in VSR research and the preprocessing techniques employed to achieve speaker independence. The survey covers the works published from 1990 to 2023, thoroughly analyzing each work and comparing them on various parameters. This survey provides an in-depth analysis of speaker-independent VSR systems evolution from 1990 to 2023. It outlines the development of VSR systems over time and highlights the need to develop end-to-end pipelines for speakerindependent VSR. The pictorial representation offers a clear and concise overview of the techniques used in speaker-independent VSR, thereby aiding in the comprehension and analysis of the various methodologies. The survey also highlights the strengths and limitations of each technique and provides insights into developing novel approaches for analyzing visual speech cues. Overall, This comprehensive review provides insights into the current state-of-the-art speaker-independent VSR and highlights potential areas for future research.

show abstract

“…For instance, IAN [143] utilizes 3D-ResNet [138] for visual representation. DNF [149] subtly designs 2D-CNN with the 1D temporal convolution, which has become one of the mainstream baseline methods. Although CNN-based methods can effectively capture spatial features in gesture images, they are limited in handling the temporal dynamics of gestures directly, and 3D-CNN-based methods involve significant computational overhead.…”

Section: Sign Language Recognitionmentioning

confidence: 99%

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Liu

Wang

et al. 2021

Front. Neurorobot.

View full text Add to dashboard Cite

The redundant information, noise data generated in the process of single-modal feature extraction, and traditional learning algorithms are difficult to obtain ideal recognition performance. A multi-modal fusion emotion recognition method for speech expressions based on deep learning is proposed. Firstly, the corresponding feature extraction methods are set up for different single modalities. Among them, the voice uses the convolutional neural network-long and short term memory (CNN-LSTM) network, and the facial expression in the video uses the Inception-Res Net-v2 network to extract the feature data. Then, long and short term memory (LSTM) is used to capture the correlation between different modalities and within the modalities. After the feature selection process of the chi-square test, the single modalities are spliced to obtain a unified fusion feature. Finally, the fusion data features output by LSTM are used as the input of the classifier LIBSVM to realize the final emotion recognition. The experimental results show that the recognition accuracy of the proposed method on the MOSI and MELD datasets are 87.56 and 90.06%, respectively, which are better than other comparison methods. It has laid a certain theoretical foundation for the application of multimodal fusion in emotion recognition.

show abstract

A Lip Reading Model Using CNN with Batch Normalization

Cited by 22 publications

References 8 publications

Human Bone Age Estimation of Carpal Bone X-Ray Using Residual Network with Batch Normalization Classification

Human Bone Age Estimation of Carpal Bone X-Ray Using Residual Network with Batch Normalization Classification

Deep Learning-Based Holistic Speaker Independent Visual Speech Recognition

Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning

Contact Info

Product

Resources

About