CN-CVS: A Mandarin Audio-Visual Dataset for Large Vocabulary Continuous Visual to Speech Synthesis

Chen, Chen; Wang, Dong; Zheng, Thomas Fang

doi:10.1109/icassp49357.2023.10095796

Cited by 4 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CN-CVS [107] dataset is a Mandarin Chinese AV dataset consisting of short snippets of human speech extracted from news broadcasts, TV shows, and web-based speech or conversation programs. It contains recordings from over 2500 speakers of different professions and ages.…”

Section: Research Datasetsmentioning

confidence: 99%

“…It contains recordings from over 2500 speakers of different professions and ages. The dataset [107] is recorded in natural, uncontrolled environments where environmental factors such as lighting conditions may vary between programs or locations. The camera angle and distance also vary within the same video clips [107].…”

Section: Research Datasetsmentioning

confidence: 99%

“…The dataset [107] is recorded in natural, uncontrolled environments where environmental factors such as lighting conditions may vary between programs or locations. The camera angle and distance also vary within the same video clips [107]. CN-CVS includes both audio and video components, with an average segment length of 6 s. The dataset [107] includes a wide range of content covering different subject areas.…”

Section: Research Datasetsmentioning

confidence: 99%

“…The camera angle and distance also vary within the same video clips [107]. CN-CVS includes both audio and video components, with an average segment length of 6 s. The dataset [107] includes a wide range of content covering different subject areas.…”

Section: Research Datasetsmentioning

confidence: 99%

See 3 more Smart Citations

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

2023

View full text Add to dashboard Cite

This article provides a detailed review of recent advances in audio-visual speech recognition (AVSR) methods that have been developed over the last decade (2013–2023). Despite the recent success of audio speech recognition systems, the problem of audio-visual (AV) speech decoding remains challenging. In comparison to the previous surveys, we mainly focus on the important progress brought with the introduction of deep learning (DL) to the field and skip the description of long-known traditional “hand-crafted” methods. In addition, we also discuss the recent application of DL toward AV speech fusion and recognition. We first discuss the main AV datasets used in the literature for AVSR experiments since we consider it a data-driven machine learning (ML) task. We then consider the methodology used for visual speech recognition (VSR). Subsequently, we also consider recent AV methodology advances. We then separately discuss the evolution of the core AVSR methods, pre-processing and augmentation techniques, and modality fusion strategies. We conclude the article with a discussion on the current state of AVSR and provide our vision for future research.

show abstract

Section: Research Datasetsmentioning

confidence: 99%

Section: Research Datasetsmentioning

confidence: 99%

Section: Research Datasetsmentioning

confidence: 99%

Section: Research Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

2023

View full text Add to dashboard Cite

show abstract

Integrated visual transformer and flash attention for lip-to-speech generation GAN

Yang,

Bai,

Liu

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Lip-to-Speech (LTS) generation is an emerging technology that is highly visible, widely supported, and rapidly evolving. LTS has a wide range of promising applications, including assisting speech impairment and improving speech interaction in virtual assistants and robots. However, the technique faces the following challenges: (1) Chinese lip-to-speech generation is poorly recognized. (2) The wide range of variation in lip-speaking is poorly aligned with lip movements. Addressing these challenges will contribute to advancing Lip-to-Speech (LTS) technology, enhancing the communication abilities, and improving the quality of life for individuals with disabilities. Currently, lip-to-speech generation techniques usually employ the GAN architecture but suffer from the following problems: The primary issue lies in the insufficient joint modeling of local and global lip movements, resulting in visual ambiguities and inadequate image representations. To solve these problems, we design Flash Attention GAN (FA-GAN) with the following features: (1) Vision and audio are separately coded, and lip motion is jointly modelled to improve speech recognition accuracy. (2) A multilevel Swin-transformer is introduced to improve image representation. (3) A hierarchical iterative generator is introduced to improve speech generation. (4) A flash attention mechanism is introduced to improve computational efficiency. Many experiments have indicated that FA-GAN can recognize Chinese and English datasets better than existing architectures, especially the recognition error rate of Chinese, which is only 43.19%, the lowest among the same type.

show abstract

A Comprehensive Review of Recent Advances in Deep Neural Networks for Lipreading With Sign Language Recognition

Rathipriya,

Maheswari

2024

IEEE Access

View full text Add to dashboard Cite

CN-CVS: A Mandarin Audio-Visual Dataset for Large Vocabulary Continuous Visual to Speech Synthesis

Cited by 4 publications

References 18 publications

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition

Integrated visual transformer and flash attention for lip-to-speech generation GAN

A Comprehensive Review of Recent Advances in Deep Neural Networks for Lipreading With Sign Language Recognition

Contact Info

Product

Resources

About