Text Segmentation as a Supervised Learning Task

Koshorek, Omri; Cohen, Adir; Mor, Noam; Rotman, Michael; Berant, Jonathan

doi:10.48550/arxiv.1803.09337

Cited by 5 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experimental results for the segmentation experiments are in Table 1a. Our baseline system that does not use any pretraining of representations is similar to the one proposed in Koshorek et al (2018) with the difference that our system uses character information to generate the word embeddings. Note that pretaining results in substantial improvements over the baseline systems in all settings.…”

Section: Resultsmentioning

confidence: 99%

“…We choose LSTM-based architectures with a standard hierarchical structure that has been useful for capturing long-term context in document-level tasks (Serban et al, 2016;Koshorek et al, 2018). We experiment with two related document-level representations.…”

Section: Hierarchical Document-level Representationsmentioning

confidence: 99%

“…In the higher level, the contextual sentence representation is formed with respect to the document. Such structures enable the models to integrate long-distance context from the documents and have been used for labeling sentence sentiment (Ruder et al, 2016), document summarization (Cheng & Lapata, 2016), text segmentation (Koshorek et al, 2018), and text classification (Yang et al, 2016), inter alia. These hierarchical neural representations have been largely learned based on task-specific labeled data, posing a challenge for applications with a limited number of annotated examples.…”

Section: Related Workmentioning

confidence: 99%

“…In this paper, we focus on tasks that require document-level understanding. We build upon two existing separate lines of work that are useful for these tasks: (1) document-level models with hierarchical architectures, which include sentence representations contextualized with respect to entire documents (Ruder et al, 2016;Cheng & Lapata, 2016;Koshorek et al, 2018;Yang et al, 2016), and (2) contextual representation learning with language-model pretraining (Peters et al, 2018;Radford et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Language Model Pre-training for Hierarchical Document Representations

Chang,

Toutanova,

Lee

et al. 2019

Preprint

View full text Add to dashboard Cite

Hierarchical neural architectures are often used to capture long-distance dependencies and have been applied to many document-level tasks such as summarization, document segmentation, and sentiment analysis. However, effective usage of such a large context can be difficult to learn, especially in the case where there is limited labeled data available. Building on the recent success of language model pretraining methods for learning flat representations of text, we propose algorithms for pre-training hierarchical document representations from unlabeled data. Unlike prior work, which has focused on pre-training contextual token representations or context-independent sentence/paragraph representations, our hierarchical document representations include fixed-length sentence/paragraph representations which integrate contextual information from the entire documents. Experiments on document segmentation, document-level question answering, and extractive document summarization demonstrate the effectiveness of the proposed pre-training algorithms.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Hierarchical Document-level Representationsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Language Model Pre-training for Hierarchical Document Representations

Chang,

Toutanova,

Lee

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…However, ancient document images suffer from critical challenges including varying noise conditions, interfering annotations, typical ancient record artifacts like fading and vanishing texts, and variations in handwriting making it difficult to transcribe [27]. Over the past decade, various approaches have been proposed to solve document analysis and recognition such as optical character recognition (OCR) [26], layout analysis [28], text segmentation [19] and handwriting recognition [34,10,9,13]. Although OCR models have been very successful in recognizing machine print text, they stumble upon handwriting recognition due to aforementioned challenges and connecting characters in the text as compared to machine print ones where the characters are easily separable.…”

Section: Introductionmentioning

confidence: 99%

Illegible Text to Readable Text: An Image-to-Image Transformation using Conditional Sliced Wasserstein Adversarial Networks

Karimi

Veni²,

Yu³

2019

Preprint

View full text Add to dashboard Cite

Automatic text recognition from ancient handwritten record images is an important problem in the genealogy domain. However, critical challenges such as varying noise conditions, vanishing texts, and variations in handwriting makes the recognition task difficult. We tackle this problem by developing a handwritten-to-machine-print conditional Generative Adversarial network (HW2MP-GAN) model that formulates handwritten recognition as a text-Image-to-text-Image translation problem where a given image, typically in an illegible form, is converted into another image, close to its machine-print form. The proposed model consists of three-components including a generator, and word-level and character-level discriminators. The model incorporates Sliced Wasserstein distance (SWD) and U-Net architectures in HW2MP-GAN for better quality image-to-image transformation. Our experiments reveal that HW2MP-GAN outperforms state-of-the-art baseline cGAN models by almost 30 in Frechet Handwritten Distance (FHD), 0.6 in average Levenshtein distance and 39% in word accuracy for imageto-image translation on IAM database. Further, HW2MP-GAN improves handwritten recognition word accuracy by 1.3% compared to baseline handwritten recognition models on IAM database.

show abstract

Auxiliary Loss for BERT-Based Paragraph Segmentation

ZHUO

Murata

2023

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Paragraph segmentation is a text segmentation task. Iikura et al. achieved excellent results on paragraph segmentation by introducing focal loss to Bidirectional Encoder Representations from Transformers.In this study, we investigated paragraph segmentation on Daily News and Novel datasets. Based on the approach proposed by Iikura et al., we used auxiliary loss to train the model to improve paragraph segmentation performance. Consequently, the average F1-score obtained by the approach of Iikura et al. was 0.6704 on the Daily News dataset, whereas that of our approach was 0.6801. Our approach thus improved the performance by approximately 1%. The performance improvement was also confirmed on the Novel dataset. Furthermore, the results of two-tailed paired t-tests indicated that there was a statistical significance between the performance of the two approaches.

show abstract

Text Segmentation as a Supervised Learning Task

Cited by 5 publications

References 0 publications

Language Model Pre-training for Hierarchical Document Representations

Language Model Pre-training for Hierarchical Document Representations

Illegible Text to Readable Text: An Image-to-Image Transformation using Conditional Sliced Wasserstein Adversarial Networks

Auxiliary Loss for BERT-Based Paragraph Segmentation

Contact Info

Product

Resources

About