Topic segmentation methods are mostly based on the idea of lexical cohesion, in which lexical distributions are analysed across the document and segment boundaries are marked in areas of low cohesion. We propose a novel approach for topic segmentation in speech recognition transcripts by measuring lexical cohesion using bidirectional Recurrent Neural Networks (RNN). The bidirectional RNNs capture context in the past and the following set of words. The past and following contexts are compared to perform topic change detection. In contrast to existing works based on sequence and discriminative models for topic segmentation, our approach does not use a segmented corpus nor (pseudo) topic labels for training. Our model is trained using news articles obtained from the internet. Evaluation on ASR transcripts of French TV broadcast news programs demonstrates the effectiveness of our proposed approach.Index Terms-topic segmentation, recurrent neural networks INTRODUCTIONThe problem of topic segmentation, to automatically breakdown a text document into topically coherent segments, has been studied for a long time. With the increase in multimedia content on the internet, there has been an interest to extend topic segmentation to audio-video documents. Multimedia documents like broadcast news programs, meeting recordings, telephone conversations and lectures commonly consist of information on more than one topic. For example, broadcast news present events related to politics, economy, sports, weather and so on. Automatic segmentation of such documents, into coherent segments, is required by several down stream tasks such as topic detection and tracking [1], summarisation, named entity extraction and for multimedia indexing and organisation [2]. * This work was performed while the author was a member of the Multispeech team
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with đź’™ for researchers
Part of the Research Solutions Family.