2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9004036
|View full text |Cite
|
Sign up to set email alerts
|

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
65
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 84 publications
(65 citation statements)
references
References 14 publications
0
65
0
Order By: Relevance
“…Training. For training, we use over 50k hours of transcribed short YouTube video segments extracted with the semi-supervised procedure originally proposed in [17] and extended in [1,18] to include video. We extract short segments where the force-aligned user uploaded transcription matches the transcriptions from a production quality ASR system.…”
Section: Datasetsmentioning
confidence: 99%
See 2 more Smart Citations
“…Training. For training, we use over 50k hours of transcribed short YouTube video segments extracted with the semi-supervised procedure originally proposed in [17] and extended in [1,18] to include video. We extract short segments where the force-aligned user uploaded transcription matches the transcriptions from a production quality ASR system.…”
Section: Datasetsmentioning
confidence: 99%
“…Up until recently, audio-visual ASR had been studied only under the ideal scenario where the video of a single speaking face matches the audio [1,2,3,4,5,6]. However, at inference time when multiple people are simultaneously visible on screen we need to decide which face to feed to the model, and this issue of active speaker selection has traditionally been considered a separate problem [7,8].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Extensive audio-visual speech recognition technologies have been conducted in recent years and demonstrated their efficacy in improving speech recognition performance under both clean and adverse conditions [32], [35], [52]- [57]. Following [49], in this work, the convolutional long short-term memory fully connected neural network (CLDNN) [58] is adopted as the recognition back-end system architecture.…”
Section: A Audio-visual Speech Recognition Back-endmentioning
confidence: 99%
“…Several audio-visual approaches have been recently presented where pre-computed visual or audio features are used [1,19,25,29,34]. Afouras et al developed a transformer-based sequence-tosequence model by using pre-computed visual features and log-Mel filter-bank features as inputs.…”
Section: Introductionmentioning
confidence: 99%