Visual Speech Recognition of Korean Words Using Convolutional Neural Network

Lee, Sung-Won; Yu, Jong Min; Park, Seung Min; Sim, Kwee-Bo

doi:10.5391/ijfis.2019.19.1.1

Cited by 3 publications

(2 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Jo et al [16] collected predefined syllables of a single speaker with 7 views. Also, Lee and Park [22] and Lee et al [23] collected predefined word utterances, such as digits and city names, of 56 and 9 speakers, respectively. Unfortunately, the size of all the datasets is too minuscule to support deep learning-driven models; moreover, some are not publicly available.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset

Park¹,

Hwang²,

Choi³

et al. 2023

Preprint

View full text Add to dashboard Cite

Inspired by humans comprehending speech in a multi-modal manner, various audio-visual datasets have been constructed. However, most existing datasets focus on English, induce dependencies with various prediction models during dataset preparation, and have only a small number of multiview videos. To mitigate the limitations, we recently developed the Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset, which is the largest among publicly available audio-visual speech datasets. The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations. We also provide the pre-trained baseline models for two tasks, audio-visual speech recognition and lip reading. We conducted experiments based on the models to verify the effectiveness of multi-modal and multiview training over uni-modal and frontal-view-only training. We expect the OLKAVS dataset to facilitate multi-modal research in broader areas such as Korean speech recognition, speaker recognition, pronunciation level classification, and mouth motion analysis.

show abstract

Section: Related Workmentioning

confidence: 99%

“…1) Korean speech recognition: Not only the size of OLKAVS greatly outplays the previous audio-visual speech datasets [16,22,23], but also comparable to the audio-only Korean speech dataset [25]. Also, our pre-trained audio-visual speech recognition model can be useful when fine-tuned to other languages [10].…”

Section: Additional Use Casesmentioning

confidence: 99%