Abstract-Discriminative language modeling (DLM) is a feature-based approach that is used as an error-correcting step after hypothesis generation in automatic speech recognition (ASR). We formulate this both as a classification and a ranking problem and employ the perceptron, the margin infused relaxed algorithm (MIRA) and the support vector machine (SVM). To decrease training complexity, we try count-based thresholding for feature selection and data sampling from the list of hypotheses. On a Turkish morphology based feature set we examine the use of first and higher order -grams and present an extensive analysis on the complexity and accuracy of the models with an emphasis on statistical significance. We find that we can save significantly from computation by feature selection and data sampling, without significant loss in accuracy. Using the MIRA or SVM does not lead to any further improvement over the perceptron but the use of ranking as opposed to classification leads to a 0.4% reduction in word error rate (WER) which is statistically significant. Index Terms-Discriminative language modeling (DLM), feature selection, data sampling, language modeling, ranking perceptron, ranking support vector machine (SVM), margin infused relaxed algorithm (MIRA), ranking MIRA, speech recognition.
The objective of this study is to automatically extract annotated sign data from the broadcast news recordings for the hearing impaired. These recordings present an excellent source for automatically generating annotated data: In news for the hearing impaired, the speaker also signs with the hands as she talks. On top of this, there is also corresponding sliding text superimposed on the video. The video of the signer can be segmented via the help of either the speech or both the speech and the text, generating segmented, and annotated sign videos. We call this application O. Aran ( ) · I. Ari · L. Akarun as Signiary, and aim to use it as a sign dictionary where the users enter a word as text and retrieve sign videos of the related sign. This application can also be used to automatically create annotated sign databases that can be used for training recognizers.
We present our work on semi-supervised learning of discriminative language models where the negative examples for sentences in a text corpus are generated using confusion models for Turkish at various granularities, specifically, word, subword, syllable and phone levels. We experiment with different language models and various sampling strategies to select competing hypotheses for training with a variant of the perceptron algorithm. We find that morph-based confusion models with a sample selection strategy aiming to match the error distribution of the baseline ASR system gives the best performance. We also observe that substituting half of the supervised training examples with those obtained in a semisupervised manner gives similar results.
The aim of this paper is to help the communication of two people, one hearing impaired and one visually impaired by converting speech to fingerspelling and fingerspelling to speech. Fingerspelling is a subset of sign language, and uses finger signs to spell letters of the spoken or written language. We aim to convert finger spelled words to speech and vice versa. Different spoken languages and sign languages such as English, Russian, Turkish and Czech are considered.
Abstract-GMM supervectors are among the most popular feature sets used in SVM-based text-independent speaker verification systems. Most of the studies use only a single supervector to represent speaker characteristics, against a set of background samples. An alternative would be to divide the total training duration into smaller pieces to increase the number of supervectors for training the minority (speaker) class. Similarly, total test duration could also be partitioned, letting the final verification be made by majority voting over decisions on smaller durations. We explore the performance of speaker verification systems in terms of EER and minDCF by breaking down the input sequence into durations of 4 minutes, 1 minute and 10 seconds. We try different training/test data amounts to investigate the generalizability of this approach. Working on the CSLU Speaker Recognition Dataset, we show that the lowest error rates are obtained when the training supervector representative duration is set equal to that of the test samples.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.