Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020

DOI: 10.18653/v1/2020.emnlp-main.672

|View full text |Cite

|

Sign up to set email alerts

|

Chapter Captor: Text Segmentation in Novels

Charuta Pethe¹,

Steven Skiena³

Abstract: Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after rem… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction5

Citation Types

Supporting

0

Mentioning

14

Contrasting

0

Year Published

2022

2022

2024

2024

Publication Types

Select...

Other4

Article1

Book1

Relationship

Self Cite0

Independent6

Authors

Journals

Cited by 11 publications

(17 citation statements)

References 24 publications

Supporting

0

Mentioning

14

Contrasting

0

Order By: Relevance

“…Text segmentation ( Pethe, Kim & Skiena, 2020 ; Haruechaiyasak, Kongyoung & Dailey, 2008 ; Koshorek et al, 2018 ; Li et al, 2020 ; Lukasik et al, 2020 ; Nguyen et al, 2021 ) is a method that is typically used to segment chapters by separating text into multiple segments or boundaries. It has also been used in many natural language processing tasks, such as word tokenization, text summarization ( Hulliyah & Kusuma, 2010 ; Awasthi et al, 2021 ), question answering prediction ( Wang, Ling & Hu, 2019 ), and machine translation ( Kong, Zhang & Hovy, 2020 ; Gupta et al, 2021 ; Budiwati & Aritsugi, 2022 ).…”

Section: Introductionmentioning

confidence: 99%

“…Recent research has proposed the building of a deep-learning system to automatically identify chapter boundaries. For example, Pethe, Kim & Skiena (2020) proposed a Chapter Captor used to correctly recognize chapter breakpoints in novels. They proposed using Bidirectional Encoder Representations from Transformers ( Devlin et al, 2018 ) to learn the semantic features and generate token-wise softmax probabilities.…”

Section: Introductionmentioning

confidence: 99%

“…Considering these limitations of BERT ( Pethe, Kim & Skiena, 2020 ), we instead propose the use of the XLNet model ( Yang et al, 2019 ). The XLNet model is a generalized auto-regressive model that uses a permutation language model that helps the model learn a bidirectional context.…”

Section: Introductionmentioning

confidence: 99%

“…The paragraph-level attention model had an F1 score of 0.8084, while the ensemble method improved its F1 score to 0.8856. Our results were then compared with the best methods found in Pethe, Kim & Skiena (2020) (BERT Break Point Prediction Model) and the F1 confusion matrix, as Their’s methods were considered to be the best practice. Pethe, Kim & Skiena (2020) used a pre-trained BERT model for the Next Sentence Prediction task combining with the dynamic programming algorithm.…”

Section: Introductionmentioning

confidence: 99%

“…Our results were then compared with the best methods found in Pethe, Kim & Skiena (2020) (BERT Break Point Prediction Model) and the F1 confusion matrix, as Their’s methods were considered to be the best practice. Pethe, Kim & Skiena (2020) used a pre-trained BERT model for the Next Sentence Prediction task combining with the dynamic programming algorithm. The BERT Break Point Prediction model ( Pethe, Kim & Skiena, 2020 ) successfully competed with all baseline models, including the C99 algorithm from Choi (2000) , the three-layer baseline perceptron model with 300 neurons in each layer ( Badjatiya et al, 2018 ) and trained word2vec embeddings from Mikolov et al (2013) , and the neural model described by Badjatiya et al (2018) that used long short-term memory (LSTM).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Paragraph-level attention based deep model for chapter segmentation

2022

PeerJ Computer Science

View full text Add to dashboard Cite

Books are usually divided into chapters and sections. Correctly and automatically recognizing chapter boundaries can work as a proxy when segmenting long texts (a more general task). Book chapters can be easily segmented by humans, but automatic segregation is more challenging because the data is semi-structured. Since the concept of language is prone to ambiguity, it is essential to identify the relationship between the words in each paragraph and classify each consecutive paragraph based on their respective relationships with one another. Although researchers have designed deep learning-based models to solve this problem, these approaches have not considered the paragraph-level semantics among the consecutive paragraphs. In this article, we propose a novel deep learning-based method to segment book chapters that uses paragraph-level semantics and an attention mechanism. We first utilized a pre-trained XLNet model connected to a convolutional neural network (CNN) to extract the semantic meaning of each paragraph. Then, we measured the similarities in the semantics of each paragraph and designed an attention mechanism to inject the similarity information in order to better predict the chapter boundaries. The experimental results indicated that the performance of our proposed method can surpass those of other state-of-the-art (SOTA) methods for chapter segmentation on public datasets (the proposed model achieved an F1 score of 0.8856, outperforming the Bidirectional Encoder Representations from Transformers (BERT) model’s F1 score of 0.6640). The ablation study also illustrated that the paragraph-level attention mechanism could produce a significant increase in performance.

“…Text segmentation ( Pethe, Kim & Skiena, 2020 ; Haruechaiyasak, Kongyoung & Dailey, 2008 ; Koshorek et al, 2018 ; Li et al, 2020 ; Lukasik et al, 2020 ; Nguyen et al, 2021 ) is a method that is typically used to segment chapters by separating text into multiple segments or boundaries. It has also been used in many natural language processing tasks, such as word tokenization, text summarization ( Hulliyah & Kusuma, 2010 ; Awasthi et al, 2021 ), question answering prediction ( Wang, Ling & Hu, 2019 ), and machine translation ( Kong, Zhang & Hovy, 2020 ; Gupta et al, 2021 ; Budiwati & Aritsugi, 2022 ).…”

Section: Introductionmentioning

confidence: 99%

“…Recent research has proposed the building of a deep-learning system to automatically identify chapter boundaries. For example, Pethe, Kim & Skiena (2020) proposed a Chapter Captor used to correctly recognize chapter breakpoints in novels. They proposed using Bidirectional Encoder Representations from Transformers ( Devlin et al, 2018 ) to learn the semantic features and generate token-wise softmax probabilities.…”

Section: Introductionmentioning

confidence: 99%

“…Considering these limitations of BERT ( Pethe, Kim & Skiena, 2020 ), we instead propose the use of the XLNet model ( Yang et al, 2019 ). The XLNet model is a generalized auto-regressive model that uses a permutation language model that helps the model learn a bidirectional context.…”

Section: Introductionmentioning

confidence: 99%

“…The paragraph-level attention model had an F1 score of 0.8084, while the ensemble method improved its F1 score to 0.8856. Our results were then compared with the best methods found in Pethe, Kim & Skiena (2020) (BERT Break Point Prediction Model) and the F1 confusion matrix, as Their’s methods were considered to be the best practice. Pethe, Kim & Skiena (2020) used a pre-trained BERT model for the Next Sentence Prediction task combining with the dynamic programming algorithm.…”

Section: Introductionmentioning

confidence: 99%

“…Our results were then compared with the best methods found in Pethe, Kim & Skiena (2020) (BERT Break Point Prediction Model) and the F1 confusion matrix, as Their’s methods were considered to be the best practice. Pethe, Kim & Skiena (2020) used a pre-trained BERT model for the Next Sentence Prediction task combining with the dynamic programming algorithm. The BERT Break Point Prediction model ( Pethe, Kim & Skiena, 2020 ) successfully competed with all baseline models, including the C99 algorithm from Choi (2000) , the three-layer baseline perceptron model with 300 neurons in each layer ( Badjatiya et al, 2018 ) and trained word2vec embeddings from Mikolov et al (2013) , and the neural model described by Badjatiya et al (2018) that used long short-term memory (LSTM).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Paragraph-level attention based deep model for chapter segmentation

2022

PeerJ Computer Science

View full text Add to dashboard Cite

Books are usually divided into chapters and sections. Correctly and automatically recognizing chapter boundaries can work as a proxy when segmenting long texts (a more general task). Book chapters can be easily segmented by humans, but automatic segregation is more challenging because the data is semi-structured. Since the concept of language is prone to ambiguity, it is essential to identify the relationship between the words in each paragraph and classify each consecutive paragraph based on their respective relationships with one another. Although researchers have designed deep learning-based models to solve this problem, these approaches have not considered the paragraph-level semantics among the consecutive paragraphs. In this article, we propose a novel deep learning-based method to segment book chapters that uses paragraph-level semantics and an attention mechanism. We first utilized a pre-trained XLNet model connected to a convolutional neural network (CNN) to extract the semantic meaning of each paragraph. Then, we measured the similarities in the semantics of each paragraph and designed an attention mechanism to inject the similarity information in order to better predict the chapter boundaries. The experimental results indicated that the performance of our proposed method can surpass those of other state-of-the-art (SOTA) methods for chapter segmentation on public datasets (the proposed model achieved an F1 score of 0.8856, outperforming the Bidirectional Encoder Representations from Transformers (BERT) model’s F1 score of 0.6640). The ablation study also illustrated that the paragraph-level attention mechanism could produce a significant increase in performance.

Collaborative Multi-agent System for Automatic Linear Text Segmentation

¹

2022

PRIMA 2022: Principles and Practice of Multi-Agent Systems

View full text Add to dashboard Cite

No abstract

Text Segmentation Algorithm Focused on Corpus Mining for Oilfield Exploration and Development

Gong,

Zhang,

Wang

et al. 2024

2024 9th International Conference on Computer and Communication Systems (ICCCS)

View full text Add to dashboard Cite

No abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.