This paper describes our work on the development of an audio segmentation, classification and clustering system applied to a Broadcast News task for the European Portuguese language.We developed a new algorithm for audio segmentation that is both accurate and uses less computational resources than other approaches. Our speaker clustering module uses a modified BIC algorithm which performs substantially better than the standard KL2 and is much faster than the full BIC. Finally, we developed a scheme for tagging certain speaker clusters (anchors) using trained cluster models. A series of tests were conducted showing the advantage of the new algorithms. This system is part of a prototype system that is daily processing the main news show of the national Portuguese broadcaster.
The subtitling of broadcast news programs are starting to become a very interesting application due to the technological advances in Automatic Speech Recognition and associated technologies. However, to build this kind of systems, several advances are necessary both in terms of the technological components and on main blocks integration. In this paper, we are presenting the overall architecture of a subtitling system running daily at RTP (the Portuguese public broadcast company). The goal is to integrate our components in a system for the subtitling of RTP programs. The global system includes the subtitling of recorded and direct programs.
This paper describes our work on the development of a large vocabulary continuous speech recognition system applied to a Broadcast News task for the European Portuguese language in the scope of the ALERT project. We start by presenting the baseline recogniser AUDIMUS, which was originally developed with a corpus of read newspaper text. This is a hybrid system that uses a combination of phone probabilities generated by several MLPs trained on distinct feature sets. The paper details the modifications introduced in this system, namely in the development of a new language model, the vocabulary and pronunciation lexicon and the training on new data from the ALERT BN corpus currently available. The system trained with this BN corpus achieved 18.4% WER when tested with the F0 focus condition (studio, planed, native, clean), and 35.2% when tested in all focus conditions.
In this work the problem of automatic decomposition of video into elementary semantic units, known in the literature as scenes, is addressed. Two multi-modal automatic scene segmentation techniques are proposed, both building upon the Scene Transition Graph (STG). In the first of the proposed approaches, speaker diarization results are used for introducing a post-processing step to the STG construction algorithm, with the objective of discarding scene boundaries erroneously identified according to visual-only dissimilarity. In the second approach, speaker diarization and additional audio analysis results are employed and a separate audio-based STG is constructed, in parallel to the original STG based on visual information. The two STGs are subsequently combined. Preliminary results from the application of the proposed techniques to broadcast videos reveal their improved performance over previous approaches.
This work deals with the problem of automatic temporal segmentation of a video into elementary semantic units known as scenes. Its novelty lies in the use of high-level audio information in the form of audio events for the improvement of scene segmentation performance. More specifically, the proposed technique is built upon a recently proposed audio-visual scene segmentation approach that involves the construction of multiple scene transition graphs (STGs) that separately exploit information coming from different modalities. In the extension of the latter approach presented in this work, audio event detection results are introduced to the definition of an audio-based scene transition graph, while a visual-based scene transition graph is also defined independently. The results of these two types of STGs are subsequently combined. The application of the proposed technique to broadcast videos demonstrates the usefulness of audio events for scene segmentation.
This paper presents an innovative virtual assistant system, which aims to address older adults' needs in a professional environment by proposing promising and innovative virtual assistance mechanisms. The system, named CogniWin, is expected to alleviate eventual age related memory degradation and gradual decrease of other cognitive capabilities (i.e. speed of processing new information, concentration level) and at the same time assist older adults to increase their learning abilities through personalized learning assistance and well-being guidance. In this paper we describe the overall system concept, the technological approach, the methodology used in the elicitation of user needs, and describe the first pre-trials' evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.