A review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.
An experimental system was engineered and implemented in 100 copies inside a real banking environment comprising: dynamic handwritten signature verification, face recognition, bank client voice recognition and hand vein distribution verification. The main purpose of the presented research was to analyze questionnaire responses reflecting user opinions on: comfort, ergonomics, intuitiveness and other aspects of the biometric enrollment process. The analytical studies and experimental work conducted in the course of this work will lead towards methodologies and solutions of the multimodal biometric technology, which is planned for further development. Before this stage is achieved a study on the data usefulness acquired from a variety of biometric sensors and from survey questionnaires filled in by banking tellers and clients was done. The decision-related sets were approximated by the Rough Set method offering efficient algorithms and tools for finding hidden patterns in data. Prediction of evaluated biometric data quality, based on enrollment samples and on user subjective opinions was made employing the developed method. After an introduction to the principles of applied biometric identity verification methods, the knowledge modelling approach is presented together with achieved results and conclusions.
A method for visual detection of lip contours in frontal recordings of speakers is described and evaluated. The purpose of the method is to facilitate speech recognition with visual features extracted from a mouth region. Different Active Appearance Models are employed for finding lips in video frames and for lip shape and texture statistical description. Search initialization procedure is proposed and error measure values are monitored in order to prevent the matching process from converging to a false local minimum. AAM-based visual features are applied in an experiment devoted to the static recognition of English vowels with SVM. Studies are carried out based on a database of recordings of 5 speakers of different skin colors. Results are thoroughly discussed and illustrated with figures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.