Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.672
|View full text |Cite
|
Sign up to set email alerts
|

Chapter Captor: Text Segmentation in Novels

Abstract: Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after rem… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(17 citation statements)
references
References 24 publications
0
14
0
Order By: Relevance
“…Text segmentation ( Pethe, Kim & Skiena, 2020 ; Haruechaiyasak, Kongyoung & Dailey, 2008 ; Koshorek et al, 2018 ; Li et al, 2020 ; Lukasik et al, 2020 ; Nguyen et al, 2021 ) is a method that is typically used to segment chapters by separating text into multiple segments or boundaries. It has also been used in many natural language processing tasks, such as word tokenization, text summarization ( Hulliyah & Kusuma, 2010 ; Awasthi et al, 2021 ), question answering prediction ( Wang, Ling & Hu, 2019 ), and machine translation ( Kong, Zhang & Hovy, 2020 ; Gupta et al, 2021 ; Budiwati & Aritsugi, 2022 ).…”
Section: Introductionmentioning
confidence: 99%
See 4 more Smart Citations
“…Text segmentation ( Pethe, Kim & Skiena, 2020 ; Haruechaiyasak, Kongyoung & Dailey, 2008 ; Koshorek et al, 2018 ; Li et al, 2020 ; Lukasik et al, 2020 ; Nguyen et al, 2021 ) is a method that is typically used to segment chapters by separating text into multiple segments or boundaries. It has also been used in many natural language processing tasks, such as word tokenization, text summarization ( Hulliyah & Kusuma, 2010 ; Awasthi et al, 2021 ), question answering prediction ( Wang, Ling & Hu, 2019 ), and machine translation ( Kong, Zhang & Hovy, 2020 ; Gupta et al, 2021 ; Budiwati & Aritsugi, 2022 ).…”
Section: Introductionmentioning
confidence: 99%
“…Recent research has proposed the building of a deep-learning system to automatically identify chapter boundaries. For example, Pethe, Kim & Skiena (2020) proposed a Chapter Captor used to correctly recognize chapter breakpoints in novels. They proposed using Bidirectional Encoder Representations from Transformers ( Devlin et al, 2018 ) to learn the semantic features and generate token-wise softmax probabilities.…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations