Text Segmentation as a Supervised Learning Task

Koshorek, Omri; Cohen, Adir; Mor, Noam; Rotman, Michael; Berant, Jonathan

doi:10.18653/v1/n18-2075

Cited by 106 publications

(191 citation statements)

References 10 publications

Supporting

Mentioning

155

Contrasting

Order By: Relevance

“…More recent approaches (Alemi and Ginsparg, 2015; Glavaš et al, 2016) involve the use of semantic representations of words to compute sentence similarities. Koshorek et al (2018) and Badjatiya et al (2018) propose neural models to identify break points within the text. Sims et al (2019) address the slightly different, but relevant task of event prediction using a neural model, on a human-annotated dataset of short events.…”

Section: Previous Workmentioning

confidence: 99%

“…Our models outperform the baselines on all metrics, with the BERT (full window) model for break prediction model giving the best results. The approaches by Reynar (1994) and Utiyama and Isahara (2001), and the neural models proposed by Badjatiya et al (2018) and Koshorek et al (2018) are global models, and are prohibitively expensive on long documents.…”

Section: Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

Chapter Captor: Text Segmentation in Novels

Pethe¹,

Allen²,

Skiena³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving an F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.

show abstract

Section: Previous Workmentioning

confidence: 99%

Section: Algorithmmentioning

confidence: 99%

Chapter Captor: Text Segmentation in Novels

Pethe¹,

Allen²,

Skiena³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…This task is often referred to as document segmentation or sometimes simply text segmentation. In Figure 1 we show one example of document segmentation from Wikipedia, on which the task is typically evaluated (Koshorek et al, 2018;Badjatiya et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

“…For example, document segmentation has been shown to improve information retrieval by indexing subdocument units instead of full documents (Llopis et al, 2002;Shtekh et al, 2018). Other applications such as summarization and information extraction can also benefit from text segmentation (Koshorek et al, 2018). The aim of document segmentation is breaking the raw text into a sequence of logically coherent sections (e.g., "Early life and marriage" and "Legacy" in our example).…”

Section: Introductionmentioning

confidence: 99%

“…Multiple neural approaches have been recently proposed for document and discourse segmentation. Koshorek et al (2018) proposed the use of Sentence 1: Annuities are rarely a good idea at the age 35 because of withdrawal restrictions Sentence 2: Wanted: An investment that's as simple and secure as a certificate of deposit but offers a return worth getting excited about.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Text Segmentation by Cross Segment Attention

Łukasik¹,

Dadachev²,

Papineni³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications. Early life and marriage:Franklin Delano Roosevelt was born on January 30, 1882, in the Hudson Valley town of Hyde Park, New York, to businessman James Roosevelt I and his second wife, Sara Ann Delano. (...) Aides began to refer to her at the time as "the president's girlfriend", and gossip linking the two romantically appeared in the newspapers.(...) Legacy: Roosevelt is widely considered to be one of the most important figures in the history of the United States, as well as one of the most influential figures of the 20th century. (...) Roosevelt has also appeared on several U.S. Postage stamps.

show abstract

DRIP: Segmenting individual requirements from software requirement documents

Zhao,

Zhang,

Lian

et al. 2023

Softw Pract Exp

View full text Add to dashboard Cite

Numerous academic research projects and industrial tasks related to software engineering require individual requirements as input. Unfortunately, according to our observation, several requirements may be packed in one paragraph without explicit boundaries in specification documents. To understand this problem's prevalence, we performed a preliminary study on the open requirement documents widely used in the academic community over the last 10 years, and found that 26% of them include this phenomenon. Several text segmentation approaches have been reported; however, they tend to identify topically coherent units which may contain more than one requirement. What is more, they do not take the constitutions of semantic units of requirements into consideration. Here we report a two‐phase learning‐based approach named DRIP to segment individual requirements from paragraphs. To be specific, we first propose a Requirement Segmentation Siamese framework, which models the similarity of sentences and their conjunction relations, and then detects the initial boundaries between individual requirements. Then, we optimize the boundaries heuristically based on the semantic completeness validation of the segments. Experiments with 1132 paragraphs and 6826 sentences show that DRIP outperforms the popular unsupervised and supervised text segmentation algorithms with respect to processing different documents (with accuracy gains of 57.65%–187.53%) and processing paragraphs of different complexity (with average accuracy gains of 54.46%–158.68%). We also show the importance of each component of DRIP to the segmentation.

show abstract

Text Segmentation as a Supervised Learning Task

Cited by 106 publications

References 10 publications

Chapter Captor: Text Segmentation in Novels

Chapter Captor: Text Segmentation in Novels

Text Segmentation by Cross Segment Attention

DRIP: Segmenting individual requirements from software requirement documents

Contact Info

Product

Resources

About