Yong-Hyeok Lee scite author profile

Yong-Hyeok Lee

3Publications

4Citation Statements Received

22Citation Statements Given

How they've been cited

How they cite others

Affiliations

Sogang University

Publications

Order By: Most citations

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

et al. 2020

View full text Add to dashboard Cite

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.

show abstract

Overcoming label noise in audio event detection using sequential labeling

Kim¹,

Mun²,

Oh³

et al. 2020

Preprint

View full text Add to dashboard Cite

This paper addresses the noisy label issue in audio event detection (AED) by refining strong labels as sequential labels with inaccurate timestamps removed. In AED, strong labels contain the occurrence of a specific event and its timestamps corresponding to the start and end of the event in an audio clip. The timestamps depend on subjectivity of each annotator, and their label noise is inevitable. Contrary to the strong labels, weak labels indicate only the occurrence of a specific event. They do not have the label noise caused by the timestamps, but the time information is excluded. To fully exploit information from available strong and weak labels, we propose an AED scheme to train with sequential labels in addition to the given strong and weak labels after converting the strong labels into the sequential labels. Using sequential labels consistently improved the performance particularly with the segment-based F-score by focusing on occurrences of events. In the meanteacher-based approach for semi-supervised learning, including an early step with sequential prediction in addition to supervised learning with sequential labels mitigated label noise and inaccurate prediction of the teacher model and improved the segmentbased F-score significantly while maintaining the event-based F-score.

show abstract

PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

Lee¹,

Cho²

2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yong-Hyeok Lee

Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model

Overcoming label noise in audio event detection using sequential labeling

PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

Contact Info

Product

Resources

About