Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Allen, Kim; Pethe, Charuta; Inoue, Nozomu; Skiena, Steve

doi:10.18653/v1/2021.findings-emnlp.356

Cited by 2 publications

(2 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is a growing interest in computational narrative analysis, ranging from analyzing the structure of narratives (Kim et al, , 2021, identifying important events in stories Keller, 2020, 2021;Papalampidi et al, 2020;Otake et al, 2020) to analyzing the relationship between characters in novels (Iyyer et al, 2016;Xanthos et al, 2016;Skorinkin, 2017;Azab et al, 2019;Labatut and Bost, 2019;Kubis, 2021;Brahman et al, 2021). The most relevant work to ours is Azab et al (2019), who apply word2vec (Mikolov et al, 2013) to learn character embeddings from movie scripts.…”

Section: Related Workmentioning

confidence: 99%

Learning and Evaluating Character Representations in Novels

Inoue¹,

Pethe²,

Allen³

et al. 2022

Findings of the Association for Computational Linguistics: ACL 2022

Self Cite

View full text Add to dashboard Cite

We address the problem of learning fixedlength vector representations of characters in novels. Recent advances in word embeddings have proven successful in learning entity representations from short texts, but fall short on longer documents because they do not capture full book-level information. To overcome the weakness of such text-based embeddings, we propose two novel methods for representing characters: (i) graph neural network-based embeddings from a full corpus-based character network; and (ii) low-dimensional embeddings constructed from the occurrence pattern of characters in each novel. We test the quality of these character embeddings using a new benchmark suite to evaluate character representations, encompassing 12 different tasks. We show that our representation techniques combined with text-based embeddings lead to the best character representations, outperforming text-based embeddings in four tasks. Our dataset is made publicly available to stimulate additional work in this area.

show abstract

Section: Related Workmentioning

confidence: 99%

Learning and Evaluating Character Representations in Novels

Inoue¹,

Pethe²,

Allen³

et al. 2022

Findings of the Association for Computational Linguistics: ACL 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…Analysis of books have been streamlined through pipelines like BookNLP (Bamman et al, 2014) as well as datasets of entities (Bamman et al, 2019). We also find other bookrelated works that improve the quality of books as well as understanding aspects such as time (Kim et al, , 2021. In particular, we focus on chap-ters and require proper chapter segmentation .…”

Section: Related Workmentioning

confidence: 99%

Chapter Ordering in Novels

Kim,

Skiena

2022

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Understanding narrative flow and text coherence in long-form documents (novels) remains an open problem in NLP. To gain insight, we explore the task of chapter ordering, reconstructing the original order of chapters in novel given a random permutation of the text. This can be seen as extending the well-known sentence ordering task to vastly larger documents: our task deals with over 9,000 novels with an average of twenty chapters each, versus standard sentence ordering datasets averaging only 5-8 sentences. We formulate the task of reconstructing order as a constraint solving problem, using minimum feedback arc set and traveling salesman problem optimization criteria, where the weights of the graph are generated based on models for character occurrences and chapter boundary detection, using relational chapter scores derived from RoBERTa. Our best methods yield a Spearman correlation of 0.59 on this novel and challenging task, substantially above baseline.

show abstract

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Cited by 2 publications

References 27 publications

Learning and Evaluating Character Representations in Novels

Learning and Evaluating Character Representations in Novels

Chapter Ordering in Novels

Contact Info

Product

Resources

About