Martin Jansche scite author profile

Censored targets, such as the time to events in survival analysis, can generally be represented by intervals on the real line. In this paper, we propose a novel support vector technique (named SVCR) for regression on censored targets. Interestingly, this approach provides a general formulation for both standard regression and binary classification tasks. SVCR inherits the strengths of support vector methods, such as a globally optimal solution by convex programming, fast training speed and strong generalization capacity. In contrast to ranking approaches to survival analysis, our approach is able not only to achieve superior ordering performance but also to predict the survival time very well. Controlled experiments show the significant performance improvement when majority of the training data is censored. Experimental results on several survival analysis datasets verify that SVCR is very competitive against classical survival analysis models.

show abstract

Restoring punctuation and capitalization in transcribed speech

Gravano

Jansche

Bacchiani

2009

106

View full text Add to dashboard Cite

Adding punctuation and capitalization greatly improves the readability of automatic speech transcripts. We discuss an approach for performing both tasks in a single pass using a purely text-based n-gram language model. We study the effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount of training data (from 58 million to 55 billion tokens). Our results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much.

show abstract

Maximum expected F-measure training of logistic regression models

Jansche

2005

View full text Add to dashboard Cite

We consider the problem of training logistic regression models for binary classification in information extraction and information retrieval tasks. Fitting probabilistic models for use with such tasks should take into account the demands of the taskspecific utility function, in this case the well-known F-measure, which combines recall and precision into a global measure of utility. We develop a training procedure based on empirical risk minimization / utility maximization and evaluate it on a simple extraction task.

show abstract

A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

Sodimana¹,

Pipatsrisawat

et al. 2018

View full text Add to dashboard Cite

Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali

Kjartansson¹,

Sarin²,

Pipatsrisawat³

et al. 2018

View full text Add to dashboard Cite

We present speech corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali. Each corpus consists of an average of approximately 200k recorded utterances that were provided by native-speaker volunteers in the respective region. Recordings were made using portable consumer electronics in reasonably quiet environments. For each recorded utterance the textual prompt and an anonymized hexadecimal identifier of the speaker are available. Biographical information of the speakers is unavailable. In particular, the speakers come from an unspecified mix of genders. The recordings are suitable for research on acoustic modeling for speech recognition, for example. To validate the integrity of the corpora and their suitability for speech recognition research, we provide simple recipes that illustrate how they can be used with the open-source Kaldi speech recognition toolkit. The corpora are being made available under a Creative Commons license in the hope that they will stimulate further research on these languages.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Martin Jansche

A Support Vector Approach to Censored Targets

Restoring punctuation and capitalization in transcribed speech

Maximum expected F-measure training of logistic regression models

A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese

Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali

Contact Info

Product

Resources

About