In this paper we present an algorithm that produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems. These features give large performance improvements on tonal languages for ASR systems, and even substantial improvements for non-tonal languages. Our method, which we are calling the Kaldi pitch tracker (because we are adding it to the Kaldi ASR toolkit), is a highly modified version of the getf0 (RAPT) algorithm. Unlike the original getf0 we do not make a hard decision whether any given frame is voiced or unvoiced; instead, we assign a pitch even to unvoiced frames while constraining the pitch trajectory to be continuous. Our algorithm also produces a quantity that can be used as a probability of voicing measure; it is based on the normalized autocorrelation measure that our pitch extractor uses. We present results on data from various languages in the BABEL project, and show a large improvement over systems without tonal features and systems where pitch and POV information was obtained from SAcC or getf0.
We describe a lattice generation method that is exact, i.e. it satisfies all the natural properties we would want from a lattice of alternative transcriptions of an utterance. This method does not introduce substantial overhead above one-best decoding. Our method is most directly applicable when using WFST decoders where the WFST is "fully expanded", i.e. where the arcs correspond to HMM transitions. It outputs lattices that include HMM-state-level alignments as well as word labels. The general idea is to create a state-level lattice during decoding, and to do a special form of determinization that retains only the best-scoring path for each word sequence. This special determinization algorithm is a solution to the following problem: Given a WFST A, compute a WFST B that, for each input-symbolsequence of A, contains just the lowest-cost path through A.
11We analyze and compare two different methods for unsupervised extractive spontaneous speech summarization in the meeting 12 domain. Based on utterance comparison, we introduce an optimal formulation for the widely used greedy maximum marginal relevance 13 (MMR) algorithm. Following the idea that information is spread over the utterances in form of concepts, we describe a system which 14 finds an optimal selection of utterances covering as many unique important concepts as possible. Both optimization problems are for-15 mulated as an integer linear program (ILP) and solved using public domain software. We analyze and discuss the performance of both 16 approaches using various evaluation setups on two well studied meeting corpora. We conclude on the benefits and drawbacks of the 17 presented models and give an outlook on future aspects to improve extractive meeting summarization.
We introduce a model for extractive meeting summarization based on the hypothesis that utterances convey bits of information, or concepts. Using keyphrases as concepts weighted by frequency, and an integer linear program to determine the best set of utterances, that is, covering as many concepts as possible while satisfying a length constraint, we achieve ROUGE scores at least as good as a ROUGEbased oracle derived from human summaries. This brings us to a critical discussion of ROUGE and the future of extractive meeting summarization.
Abstract-The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multiparty meetings, and is part of the larger CALO personal assistant system. This paper presents the CALO-MA architecture and its speech recognition and understanding components, which include real-time and offline speech transcription, dialog act segmentation and tagging, topic identification and segmentation, question-answer pair identification, action item recognition, decision extraction, and summarization.
The CALO Meeting Assistant provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multiparty meetings, and is part of the larger CALO personal assistant system. This paper summarizes the CALO-MA architecture and its speech recognition and understanding components, which include real-time and offline speech transcription, dialog act segmentation and tagging, question-answer pair identification, action item recognition, decision extraction, and summarization.
With the advent of smart-home devices providing voice-based interfaces, such as Amazon Alexa or Apple Siri, voice data is constantly transferred to cloud services for automated speech recognition or speaker verification. While this development enables intriguing new applications, it also poses significant risks: Voice data is highly sensitive since it contains biometric information of the speaker as well as the spoken words. This data may be abused if not protected properly, thus the security and privacy of billions of end-users is at stake. We tackle this challenge by proposing an architecture, dubbed VoiceGuard, that efficiently protects the speech processing task inside a trusted execution environment (TEE). Our solution preserves the privacy of users while at the same time it does not require the service provider to reveal model parameters. Our architecture can be extended to enable user-specific models, such as feature transformations (including fMLLR), i-vectors, or model transformations (e.g., custom output layers). It also generalizes to secure on-premise solutions, allowing vendors to securely ship their models to customers. We provide a proof-of-concept implementation and evaluate it on the Resource Management and WSJ speech recognition tasks isolated with Intel SGX, a widely available TEE implementation, demonstrating even real time processing capabilities.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.