HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Verkhodanova, Vasilisa; Ronzhin, Andrey; Kipyatkova, Irina; Ivanko, Denis; Karpov, Alexey; Železný, Miloš

doi:10.1007/978-3-319-43958-7_40

Cited by 26 publications

(13 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Among them we find the single speaker RM-3000 corpus [54] which contains a vocabulary of 1,000 different words and 3,000 utterances. In contrast, we find 360 several multi-speaker databases, namely OuluVS2 [82], TCD-TIMIT [84], HAVRUS [85], IBM AV-ASR [83], VLRF [37] and AV Digits [86], which contain 53, 62, 20, 262, 24 and 53 subjects, respectively. OuluVS2 contains recordings of speakers uttering phrases and 365 sentences; each speaker repeated three times a set of 10 daily-use phrases (similar to OuluVS) and read 10 TIMIT sentences randomly chosen from a total of 530 sentences.…”

Section: Word and Sentence Recognitionmentioning

confidence: 99%

“…OuluVS2 contains recordings of speakers uttering phrases and 365 sentences; each speaker repeated three times a set of 10 daily-use phrases (similar to OuluVS) and read 10 TIMIT sentences randomly chosen from a total of 530 sentences. On the other hand, the TCD-TIMIT dataset contains more than 6,900 different sentences and nearly 370 14,000 utterances while the HAVRUYS database [85], in Russian, provides 4,000 utterances from 20 speakers. The IBM AV-ASR database is a large corpus whose sentences contain more than 10,000 words, but unfortunately it is not publicly available.…”

Section: Word and Sentence Recognitionmentioning

confidence: 99%

“…For these reasons, many authors have proposed different phoneme-to-viseme mappings, with various definitions and numbers of visemes [28,29,30,31,32,33,18]. In 85 contrast, other authors dispute the existence of visemes and defend that visual ambiguities can be completely resolved using context from neighboring characters, words or a language model [16,34,19,25]. They argue that working through visemes to understand speech 90 is an irrecoverable loss of information.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Survey on automatic lip-reading in the era of deep learning

Fernandez-Lopez

Sukno

2018

Image and Vision Computing

View full text Add to dashboard Cite

In the last few years, there has been an increasing interest in developing systems for Automatic LipReading (ALR). Similarly to other computer vision applications, methods based on Deep Learning (DL) have become very popular and have permitted to substantially push forward the achievable performance. In this survey, we review ALR research during the last decade, highlighting the progression from approaches previous to DL (which we refer to as traditional) toward end-to-end DL architectures. We provide a comprehensive list of the audiovisual databases available for lipreading, describing what tasks they can be used for, their popularity and their most important characteristics, such as the number of speakers, vocabulary size, recording settings and total duration. In correspondence with the shift toward DL, we show that there is a clear tendency toward large-scale datasets targeting realistic application settings and large numbers of samples per class. On the other hand, we summarize, discuss and compare the different ALR systems proposed in the last decade, separately considering traditional and DL approaches. We address a quantitative analysis of the different systems by organizing them in terms of the task that they target (e.g. recognition of letters or digits and words or sentences) and comparing their reported performance in the most commonly used datasets. As a result, we find that DL architectures perform similarly to traditional ones for simpler tasks but report significant improvements in more complex tasks, such as word or sentence recognition, with up to 40% improvement in word recognition rates. Hence, we provide a detailed description of the available ALR systems based on end-to-end DL architectures and identify a tendency to focus on the modeling of temporal context as the key to advance the field. Such modeling is dominated by recurrent neural networks due to their ability to retain context at multiple scales (e.g. short-and longterm information). In this sense, current efforts tend toward techniques that allow a more comprehensive modeling and interpretability of the retained context.

show abstract

Section: Word and Sentence Recognitionmentioning

confidence: 99%

Section: Word and Sentence Recognitionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Survey on automatic lip-reading in the era of deep learning

Fernandez-Lopez

Sukno

2018

Image and Vision Computing

View full text Add to dashboard Cite

show abstract

“…IV2 [192] database was a sentence level database based on French, with 300 people participating in the recording, each speaking 15 French sentences. There are also Czech databases UWB-05-HSAVC [187] and UWB-07-ICAV [193], NDUTAVSC [168] database for German, HAVRUS [199] database for Russian, BL [196] database for French, and VLRF [202] database for Spanish.…”

Section: ) Word Phrase and Sentence Recognitionmentioning

confidence: 99%

A Survey of Research on Lipreading Technology

et al. 2020

View full text Add to dashboard Cite

“…В качестве базы данных использовался корпус аудиовизуальной русской речи с высокоскоростными видеозаписями HAVRUS [3]. Корпус состоит из записи 20 русских дикторов (10 мужчин и 10 женщин), каждый из которых произносил по 200 подобранных фраз: 130 фраз для обучения были взяты из двух фонетически представительных текстов и были одинаковы для всех дикторов, 70 фраз для тестирования являлись телефонными номерами и отличались для всех дикторов.…”

unclassified

Accuracy increase for automatic visual Russian speech recognition: viseme classes optimization

Викторович¹,

Валерьевич²,

Анатольевич³

2018

Naučno-teh. vestn. inf. tehnol. meh. opt.

View full text Add to dashboard Cite

Научно-технический вестник информационных технологий, механики и оптики, Mechanics and Optics, 2018, vol. 18, no. 2, pp. 346-349 (in Russian). doi: 10.17586/2226-1494-2018 Abstract Nowadays there are a lot of continuous studies on the correct viseme classes to be used for the most effective automatic lipreading. The paper proposes a structured approach for the development of speaker-dependent classes of visemes. This method

show abstract

HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech

Cited by 26 publications

References 10 publications

Survey on automatic lip-reading in the era of deep learning

Survey on automatic lip-reading in the era of deep learning

A Survey of Research on Lipreading Technology

Accuracy increase for automatic visual Russian speech recognition: viseme classes optimization

Contact Info

Product

Resources

About